The bioimage model zoo uses zenodo as our storage and it's natural to use the download statistics. However, we don't know how exactly zenodo count the downloads. In the Zenodo documentation about user statistics, it vaguely documented what are the differences of download and unique download
What is a download?
A user (human or machine) downloading a file from a record, excluding double-clicks and robots. If a record has multiple files and you download all files, each file counts as one download.
What is a unique download?
A unique download is defined as one or more file downloads from files of a single record by a user within a 1-hour time-window. This means that if one or more files of the same record were downloaded multiple times by the same user within the same time-window, we consider it as one unique download.
In this description, they seem to differentiate 3 categories: human, machine and robots, and they group human and machine so they call it user. Therefore, the download statistic will basically include human and machine, but not the robots.
With @FynnBe , we diged it a bit deeper, and found out what zenodo actually do under the hood (open source rocks here!).
Here is what we found:
In Zenodo's statistic architecture, they uses a module named invenio_stats which does the download statistics. And invenio_stats uses a Python package named counter-robots which does the actual detection of human vs machine vs robots.
Basically, when any http client (browser, python scripts, or web crawler), in their HTTP header, they will send a special header named User-Agent which contains a string --this is how the server used to check what browser they use, or whether it's a python scripts. For example: Chrome's User-Agent will be Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 and Python requests's user agent will be something like python-requests/2.28.2.
When the zenodo server receives the request, it inspect the User-Agent, and check against two known list: robot and machine. And it will exclude the robot but include the machine or the browser download in the download statistics.
This means, any download from our bioimageio.core library or the bioimage.io downloader for example are all counted in the download statistics. This also means our CI download will also be counted!
However, by setting the user-agent as robot (e.g. User-Agent=bot) in the bioimageio.core or bioimageio.spec library, we can easily label our CI script as robot such that it can be excluded in the download statistics. Here is an example shows how you can set user agent in Python.
We can automatically label the bioimageio.core as robot, by detecting the CI environment. For that, there seems to be a common "CI" env var: https://stackoverflow.com/a/75223617
The bioimage model zoo uses zenodo as our storage and it's natural to use the download statistics. However, we don't know how exactly zenodo count the downloads. In the Zenodo documentation about user statistics, it vaguely documented what are the differences of download and unique download
In this description, they seem to differentiate 3 categories: human, machine and robots, and they group human and machine so they call it user. Therefore, the download statistic will basically include human and machine, but not the robots.
With @FynnBe , we diged it a bit deeper, and found out what zenodo actually do under the hood (open source rocks here!).
Here is what we found:
User-Agent
which contains a string --this is how the server used to check what browser they use, or whether it's a python scripts. For example: Chrome's User-Agent will beMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36
and Python requests's user agent will be something likepython-requests/2.28.2
.User-Agent
, and check against two known list: robot and machine. And it will exclude the robot but include the machine or the browser download in the download statistics.This means, any download from our bioimageio.core library or the bioimage.io downloader for example are all counted in the download statistics. This also means our CI download will also be counted!
However, by setting the user-agent as robot (e.g.
User-Agent=bot
) in the bioimageio.core or bioimageio.spec library, we can easily label our CI script asrobot
such that it can be excluded in the download statistics. Here is an example shows how you can set user agent in Python.cc @fjug @akreshuk @FynnBe @constantinpape