bioimage-io / collection-bioimage-io

(deprecated in favor of bioimage-io/collection) RDF collection for BioImage.IO
5 stars 9 forks source link

Improve download statistics #627

Open oeway opened 1 year ago

oeway commented 1 year ago

In order to improve the statistics on the website, maybe a quicker way is to normalize the current number by its file numbers. A major reason for the over exaggerated download numbers for the models is because it add up all the download counts for all the files in one deposit. There are two things we can make it more realistic:

  1. We can use the total downloaded volume count dividing the actual total size of files. Alternatively (should be less accurate), divid the unique download number by the number of files in the deposit (see more info here: https://help.zenodo.org/faq/#statistics). Note that the CI download should be already filtered out.
  2. Cache the cover image in gh-pages, i.e. fetch the cover image, resize it so we make it small, so each time the website opens it won't pull image from zenodo. This will also make the website faster.

@FynnBe what do you think?

FynnBe commented 1 year ago

I think 2. is a good idea to speed up the website (please specify some details for implementation like desired file size and extension), but what became of the idea to track download counts through the core libraries with a third party service? Otherwise I'm thinking

But from https://help.zenodo.org/faq/#statistics I take that we wouldn't have to divide the unique download count by the number of files as all files downloaded within a 1 hour window count as one unique download. So maybe we just take that in combination with cover image caching and call it good enough for now?

oeway commented 1 year ago

I think 2. is a good idea to speed up the website (please specify some details for implementation like desired file size and extension), but what became of the idea to track download counts through the core libraries with a third party service?

Yes, using core library could be more accurate, however, I don't like the fact that we require every client (core, the packager, deepimagej, AVIA, qupath etc.) implement report the downloads actively. It will make it super complicated to maintain the statistics system, every updates need to propagate to all the clients in implemented in different languages, and if things go wrong, we lose the download count. There might be also GDPR implications for collecting such data. If we can achieve the download statistic from zenodo directly, it will be much more maintainable and effortless.

Otherwise I'm thinking

  • normalize by size: if someone specifies more than one sets of weights, download volume might be skewed due to selective weights format download. Almost a disincentive to upload multiple weights formats.
  • normalize by file count: same issue; a well documented description with multiple cover images, maybe multiple input tensors gets undercounted.

But from https://help.zenodo.org/faq/#statistics I take that we wouldn't have to divide the unique download count by the number of files as all files downloaded within a 1 hour window count as one unique download. So maybe we just take that in combination with cover image caching and call it good enough for now?

Yes, if that's the case, would be great! Otherwise, I think we don't need to have a perfect solution, if we can manage to improve it compared to our current method, we can put a disclaimer when displaying these statistics, make it transparent that where the error can come from, that should be good enough for now.

For the cover image: The actual size is W: 296, H: 167, I am thinking if we make it twice, say: 600x340, it should be good enough:

The we can use png as format.

FynnBe commented 1 year ago

If we can achieve the download statistic from zenodo directly, it will be much more maintainable and effortless.

Yes, definitely 👍

Alright, I'll implement it in the coming weeks.

oeway commented 1 year ago

Hi @FynnBe Would you have some time to look into this one? We are gathering some downloading numbers for a report, would be great if we can access to more realistic download numbers.

FynnBe commented 1 year ago

We are using zenodo's unique_download count already: https://github.com/bioimage-io/collection-bioimage-io/blob/2e29ed107770b35d8a47b86961fc438b9ecc9114/scripts/update_external_resources.py#L199

oeway commented 1 year ago

Great! The number seems still a bit too high, fixing the cache of cover image is definitely the way to go. But for now, would the number more realistic if we use the download volume divide by the total file size? This will fix the r existing stats for the models? What do you think?

FynnBe commented 1 year ago

sure, I'll calculate that for all current download counts and add it as an offset to zenodo's reported download count.

oeway commented 1 year ago

@FynnBe This is fantastic! Thanks a lot!

FynnBe commented 1 year ago

Still need to add caching soon... currently we create thumbnails on every CI run

oeway commented 1 year ago

For the record, I took a screenshot of the download number now for page 3.

Let's check how the number change tomorrow. One thing to verify is whether the CI increases the download number.

Screenshot 2023-09-13 at 17 31 16
oeway commented 1 year ago

@FynnBe This is how it looks like today:

Screenshot 2023-09-18 at 22 24 12

From what I can see, some models didn't change while some others increased the number. What do you think?

FynnBe commented 1 year ago

that's very good. so bioimage.io visits do not trigger download count bumps anymore... we should keep in mind though that the collection CI has been failing these last few days: https://github.com/bioimage-io/collection-bioimage-io/actions/workflows/auto_update_main.yaml I'll take a look into that...

FynnBe commented 1 year ago

opened a PR to fix the broken deepimagej manifest: https://github.com/deepimagej/models/pull/51 moving forward https://github.com/bioimage-io/collection-bioimage-io/pull/635 should avoid bumping download counts multiple times by CI for now.

oeway commented 1 year ago

Cross reference: https://github.com/bioimage-io/bioimage.io/issues/353

oeway commented 1 year ago

@FynnBe It looks like we are counting the CI downloads, in the history, every model increases its download by 3 https://github.com/bioimage-io/collection-bioimage-io/commit/f8b4156caa144d47002fdeb082d8308c89570b50

Is this becasue we downloaded the yaml file, based on the definition of unique download, if we download any of the file in the deposit within 1 hour window, it will be count as a unique download -- this is different from what we want.

Maybe it's better to use the total volume / total size after all.

The model "3D UNet Arabidopsis Apical " for example: https://bioimage.io/#/?id=10.5281%2Fzenodo.6346511

The pytorch package size is 42.8 MB, while the size of the two cover images is 0.8MB and the yaml file 3KB, this means we need to download the two cover images 52 times or the RDF yaml file 14,758 times to be able to count as 1 download.

FynnBe commented 1 year ago

@FynnBe It looks like we are counting the CI downloads, in the history, every model increases its download by 3 f8b4156

Is this becasue we downloaded the yaml file, based on the definition of unique download, if we download any of the file in the deposit within 1 hour window, it will be count as a unique download -- this is different from what we want.

Maybe it's better to use the total volume / total size after all.

The model "3D UNet Arabidopsis Apical " for example: https://bioimage.io/#/?id=10.5281%2Fzenodo.6346511

The pytorch package size is 42.8 MB, while the size of the two cover images is 0.8MB and the yaml file 3KB, this means we need to download the two cover images 52 times or the RDF yaml file 14,758 times to be able to count as 1 download.

The issue with the download volume are the cases where two or more weight formats are specified. We encourage downloads with the preferred weights format only, so each real download would only count as approx. 1/2 or 1/3...

Let us update the bioimageio packages to check the "CI" env var and adapt our requests accordingly... The factor 3 is weird though... I'll try to investigate. nothing should request all RDFs from zenodo... Our CI only download's the during the PRs and on bioimageio version bumps. (So it might be other's CI that does the regular bumping; we need to amend the request headers)

oeway commented 1 year ago

@FynnBe I think the issue is that we cannot rely on the fact that no file is downloaded from the deposit. I just checked that our yaml file contains the zenodo link to the cover images and test inputs/outputs, this means that when the user open a model card, or do the test run, it will pull from the zenodo, thus increase the download number.

Maybe we should replace the cover image link within the rdf file too, so when the user open the model card it won't count as unique download.

For a more conservative download count, maybe a model download number should be computed from the volume. Or present it as is, saying download count by volume, and when we compute the total size, we can average the weights file size so we are more close to the reality.

We can perhaps use the the unique download number, and rename it to user interaction count, so any interaction, either download or run the test input/output can count as 1 user interaction.

Thanks for the CI configuration, that helps a lot.

What do you think?

FynnBe commented 1 year ago

@FynnBe I think the issue is that we cannot rely on the fact that no file is downloaded from the deposit. I just checked that our yaml file contains the zenodo link to the cover images and test inputs/outputs, this means that when the user open a model card, or do the test run, it will pull from the zenodo, thus increase the download number.

the collection.json that the website should use does not contain any zenodo links: https://github.com/bioimage-io/collection-bioimage-io/blob/gh-pages/collection.json But the collection.json does not contain links to test inputs/outputs... (we could opt to cache them just like the cover images though... however at some point that might be a lot to deploy to gh-pages...)

Maybe we should replace the cover image link within the rdf file too, so when the user open the model card it won't count as unique download.

We could, but for the website you may as well read covers from the collection.json file. I like to keep the reference to the original, non-resized cover stored at zenodo in the RDF and have the thumbnail in collection.json or is the thumbnail not big enough for the expanded model card??

For a more conservative download count, maybe a model download number should be computed from the volume. Or present it as is, saying download count by volume, and when we compute the total size, we can average the weights file size so we are more close to the reality.

I think we'll manage do make the unique download count a meaningful measure. Once our website, our CIs and the core python packages (and JAVA libraries) all account for use in CI this should be the best measure. until then we can update the offset by the estimate of download volume. But a rogue CI using Zenodo links directly will also increase the download volume more than we want any CI to do... So we can only encourange users and developers to interact with the modelzoo through our libraries that set the "User-Agent" accordingly...

We can perhaps use the the unique download number, and rename it to user interaction count, so any interaction, either download or run the test input/output can count as 1 user interaction.

That's a good proposal 👍 (Then we also don't need to worry about caching test inputs/output)

Thanks for the CI configuration, that helps a lot.

we are getting there...