FynnBe commented 1 year ago

Here are some notes that hopefully --- through some more discussion --- will turn into an overview of coming changes. Special thanks to @oeway and @k-dominik for discussions so far.

Shortcomings of current system

Unclear source of truth --- descriptions on zenodo are patched in this GH repo
Users may suffer from rate limits set out of our contorl (by Zenodo or GitHub)
Lengthy loop of (1) proposing new descriptions (currently upload to zenodo), (2) testing them on our side, and updating the proposal (1) resulting in unusable versions.

Currently ruled out, potential ways to address shortcomings

don't patch
- con: published Zenodo records are often in need of patching
cache to an S3 storage under our control
- con: with multiple sources of truth and descriptions contributed by partners directly via GH, keeping a valid cache is challenging and itself relies on access to GH/Zenodo
Use Zenodo sandbox for description proposals
- con: may disappear if proposal proceeds too slowly, storage not under our control

The currently most promising way to address shortcomings

S3 first approach:
- Proposals get bioimagieo internal id right away, once they are accepted we publish them on zenodo and add the concept doi and version doi [^1](maybe we make the version field mandatory from now on?, so semantiv versions can be mapped to dois?).
- description updates get a bioimageio internal id right away (maybe 'update-' + their id?), once the update is accepted we publish it on zenodo and get a new version doi.
  1. The S3 first approach makes sure that we are in control of any rate limits
  2. S3 first approach allows for immediate evaluation of user uploads

Cons of "S3 first"

not free
renders Zenodo's download statistics for bioimageio descriptions meaningless (we need our own solution, @oeway proposed a light-weight proxy service that can keep track of accesses; alternativley, maybe there are even some "built-in" mechanisms for this? something in the direction of https://docs.aws.amazon.com/AmazonS3/latest/userguide/aws-usage-report.html)

Still unclear (to me) about "S3 first"

replacement/update of the current resource description review process including the generated PR that serves as a space to have a chat between contributor and bioimageio maintainers.
- Maybe we can use https://gitter.im/ ? Apparently there is a matrix.org based API, so we could create a channel for each resource description.
- looking into gitter brings me to our AI4Life matrix ...

Details in need of further discussion/thought

Use of S3 object redirects to realize concept of resource pointing the latest version (alternative is of course simple duplication)

[^1]: note that one can reserve a DOI and then, e.g. include it in files in that record, see "Can I know the DOI of my record before publishing, so that I can include it in the paper or dataset?"

FynnBe commented 1 year ago

Thanks for discussion @jmetz

Our idea:

use GitLab

resource (model) contributors have account on our GitLab server
upload to S3 creates repo (under their own user account)
merge request to the general collection
testing etc with CI (and possibly on GPUs)

We need

GitLab instance
- Test CI (rather simple as per resource)
- figure out a bunch of details!

jmetz commented 1 year ago

Also as GitLab can be configured to use S3 for all of its storage, this might even simplify things further.

bioimage-io / collection-bioimage-io

Redesign of bioimage.io collection #659