bioimage-io / collection-bioimage-io

(deprecated in favor of bioimage-io/collection) RDF collection for BioImage.IO
5 stars 9 forks source link

Redesign of bioimage.io collection #659

Open FynnBe opened 1 year ago

FynnBe commented 1 year ago

Here are some notes that hopefully --- through some more discussion --- will turn into an overview of coming changes. Special thanks to @oeway and @k-dominik for discussions so far.

Shortcomings of current system

  1. Unclear source of truth --- descriptions on zenodo are patched in this GH repo
  2. Users may suffer from rate limits set out of our contorl (by Zenodo or GitHub)
  3. Lengthy loop of (1) proposing new descriptions (currently upload to zenodo), (2) testing them on our side, and updating the proposal (1) resulting in unusable versions.

Currently ruled out, potential ways to address shortcomings

  1. don't patch
    • con: published Zenodo records are often in need of patching
  2. cache to an S3 storage under our control
    • con: with multiple sources of truth and descriptions contributed by partners directly via GH, keeping a valid cache is challenging and itself relies on access to GH/Zenodo
  3. Use Zenodo sandbox for description proposals
    • con: may disappear if proposal proceeds too slowly, storage not under our control

The currently most promising way to address shortcomings

  1. S3 first approach:
    • Proposals get bioimagieo internal id right away, once they are accepted we publish them on zenodo and add the concept doi and version doi [^1](maybe we make the version field mandatory from now on?, so semantiv versions can be mapped to dois?).
    • description updates get a bioimageio internal id right away (maybe 'update-' + their id?), once the update is accepted we publish it on zenodo and get a new version doi.
      1. The S3 first approach makes sure that we are in control of any rate limits
      2. S3 first approach allows for immediate evaluation of user uploads

Cons of "S3 first"

Still unclear (to me) about "S3 first"

Details in need of further discussion/thought

[^1]: note that one can reserve a DOI and then, e.g. include it in files in that record, see "Can I know the DOI of my record before publishing, so that I can include it in the paper or dataset?"

FynnBe commented 1 year ago

Thanks for discussion @jmetz

Our idea:

use GitLab

We need

jmetz commented 1 year ago

Also as GitLab can be configured to use S3 for all of its storage, this might even simplify things further.