dockstore / dockstore

Our VM/Docker sharing infrastructure and management component
https://dockstore.org/
Apache License 2.0
116 stars 27 forks source link

Harvest DOIs from Zenodo Proof of Concept #5880

Open coverbeck opened 1 month ago

coverbeck commented 1 month ago

Description This is very drafty; I'm looking for feedback on the overall concept, and whether we should go down this route. FWIW, I think we should.

The code looks for Zenodo DOIs against GitHub repos for published workflows. It found DOIs against 89 repos that have registered workflows in Dockstore. 19 of those repos have more than 1 workflow, we would not be able to tell which workflow(s) the DOI applies to.

To complete this PR:

  1. Fetch all the DOI versions related to a single DOI (another Zenodo call)
  2. Assign the DOIs to workflow versions

Things to figure out

Some of these also come up in #5879

Review Instructions

Issue dockstore/dockstore#5745

Security and Privacy

If there are any concerns that require extra attention from the security team, highlight them here and check the box when complete.

e.g. Does this change...

Please make sure that you've checked the following before submitting your pull request. Thanks!

coverbeck commented 1 month ago

Ran it on all our workflows. Found 71 repos referenced by DOIs that only have 1 workflow in Dockstore. Found an additional 18 repos referenced by DOIs, that have more than 1 workflow.

coverbeck commented 1 month ago

Here is the result for repos with 1 workflow:

[
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10901674"
    ],
    "repo": "nf-core/airrflow"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4208836"
    ],
    "repo": "h3abionet/TADA"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10846111"
    ],
    "repo": "nf-core/funcscan"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10687430"
    ],
    "repo": "nf-core/eager"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10986616"
    ],
    "repo": "nf-core/riboseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10463781"
    ],
    "repo": "nf-core/methylseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10643212"
    ],
    "repo": "nf-core/circdna"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10911752"
    ],
    "repo": "nf-core/metatdenovo"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7643948"
    ],
    "repo": "nf-core/phyloplace"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10707294"
    ],
    "repo": "nf-core/epitopeprediction"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8164980"
    ],
    "repo": "nf-core/viralintegration"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.5104871"
    ],
    "repo": "denis-yuen/galaxy-workflow-dockstore-example-2"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7764938",
      "https://doi.org/10.5281/zenodo.3746584"
    ],
    "repo": "nf-core/viralrecon"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4723017"
    ],
    "repo": "nf-core/clipseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10651816"
    ],
    "repo": "nf-core/mag"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8427707"
    ],
    "repo": "nf-core/mhcquant"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7220729"
    ],
    "repo": "nf-core/hlatyping"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6515313"
    ],
    "repo": "nf-core/hicar"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10696391"
    ],
    "repo": "nf-core/smrnaseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7139814"
    ],
    "repo": "nf-core/chipseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4106005"
    ],
    "repo": "nf-core/proteomicslfq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10554425"
    ],
    "repo": "nf-core/scrnaseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10471647"
    ],
    "repo": "nf-core/rnaseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.11126488"
    ],
    "repo": "nf-core/sarek"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10650749"
    ],
    "repo": "nf-core/molkart"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10783110"
    ],
    "repo": "nf-core/nascent"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10952554"
    ],
    "repo": "nf-core/rnafusion"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6669637"
    ],
    "repo": "nf-core/rnavar"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4469317"
    ],
    "repo": "gatk-workflows/gatk4-data-processing"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6403063"
    ],
    "repo": "kathy-t/workflow-dockstore-yml"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.5104898"
    ],
    "repo": "Richard-Hansen/dockstore-whalesay"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7994878"
    ],
    "repo": "nf-core/hic"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10036158"
    ],
    "repo": "nf-core/bacass"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10209675"
    ],
    "repo": "nf-core/differentialabundance"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7716033"
    ],
    "repo": "nf-core/nanoseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7689178"
    ],
    "repo": "denis-yuen/test-workflows-and-tools"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10728509"
    ],
    "repo": "nf-core/fetchngs"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10124950"
    ],
    "repo": "kathy-t/SRANWRP"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7941033"
    ],
    "repo": "nf-core/hgtseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.5039442"
    ],
    "repo": "dockstore-personal-testing/gatk4-exome-analysis-pipeline-flat"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4304953"
    ],
    "repo": "Richard-Hansen/hello_world"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.3508160",
      "https://doi.org/10.5281/zenodo.3928817",
      "https://doi.org/10.5281/zenodo.3401699"
    ],
    "repo": "wshands/hmmer-docker"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7080256"
    ],
    "repo": "david4096/autopotato-attack"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8222875"
    ],
    "repo": "nf-core/atacseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.2628872"
    ],
    "repo": "ICGC-TCGA-PanCancer/Seqware-BWA-Workflow"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6141389"
    ],
    "repo": "Richard-Hansen/dockstore-tool-helloworld"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.5104873"
    ],
    "repo": "garyluu/example_cwl_workflow"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.1491630"
    ],
    "repo": "nf-core/deepvariant"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.3571864"
    ],
    "repo": "nf-core/neutronstar"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.2582812"
    ],
    "repo": "SciLifeLab/Sarek"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4536530"
    ],
    "repo": "Richard-Hansen/dockstore-whalesay2"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10668725"
    ],
    "repo": "ENCODE-DCC/atac-seq-pipeline"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10718449"
    ],
    "repo": "nf-core/demultiplex"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8354570"
    ],
    "repo": "iwc-workflows/rnaseq-pe"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10067429"
    ],
    "repo": "nf-core/quantms"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7629996"
    ],
    "repo": "nf-core/proteinfold"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10622411"
    ],
    "repo": "nf-core/readsimulator"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10868876"
    ],
    "repo": "nf-core/raredisease"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10912278"
    ],
    "repo": "nf-core/ampliseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10527467"
    ],
    "repo": "nf-core/nanostring"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8159051"
    ],
    "repo": "nf-core/marsseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10634361"
    ],
    "repo": "nf-core/taxprofiler"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8354480"
    ],
    "repo": "iwc-workflows/rnaseq-sr"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6403310"
    ],
    "repo": "kathy-t/hello-wdl-workflow"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10895229"
    ],
    "repo": "nf-core/pixelator"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10606804"
    ],
    "repo": "nf-core/cutandrun"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10877148"
    ],
    "repo": "nf-core/detaxizer"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4540719"
    ],
    "repo": "nf-core/dualrnaseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10406093"
    ],
    "repo": "nf-core/metaboigniter"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8414663"
    ],
    "repo": "nf-core/bamtofastq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10696998"
    ],
    "repo": "nf-core/rnasplice"
  }
]
denis-yuen commented 1 month ago

Do we snapshot a version before assigning the DOI? I would argue no.

I think no too. I think I'm leaning toward @svonworl 's (?) idea to have more database structure around how we track DOIs and track user-generated, auto-generated, and github harvested(?) DOIs separately.

This is my preference at least for the first iteration of this.

Works for me, also ok with the "let the user choose" For most purposes, I'd think that "let the user choose from a list of likely suspects" would be ok as a first/second pass too, just not a final pass

We probably need to track the "source" of the DOI. For example, Kathy is working on code to issue editing urls for DOIs created by Dockstore with the Dockstore Zenodo account; that won't make sense for these DOIs.

Kinda feel like they should just be separately tracked/different classes.

We invoke the above endpoint on a schedule, e.g., nightly. How do we do that?

Seems familiar @kathy-t More for a periodic listener, a registration method for cron-like events, etc. (along with topic generation, metric aggregation, etc.) Unless we do it all externally like with the ECS cron

kathy-t commented 1 month ago

I think I'm leaning toward @svonworl 's (?) idea to have more database structure around how we track DOIs and track user-generated, auto-generated, and github harvested(?) DOIs separately.

I'm working on implementing this in my PR to allow users to generate their own DOIs for a workflow that already has Dockstore DOIs

We invoke the above endpoint on a schedule, e.g., nightly. How do we do that? Seems familiar @kathy-t More for a periodic listener, a registration method for cron-like events, etc. (along with topic generation, metric aggregation, etc.) Unless we do it all externally like with the ECS cron

Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue (AWS doc)

denis-yuen commented 1 month ago

Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue

May be overkill, just like with topics, it should be possible to just compute which tags are eligible for DOIs and just process them without needing to keep any extra state around.

svonworl commented 1 month ago

19 of those repos have more than 1 workflow, we would not be able to tell which workflow(s) the DOI applies to.

The "GitHub" DOIs seem to reference the repo, rather than a particular entry within. So, maybe said DOIs reference all entries in the repo? [Postscript: just noticed you mentioned this possibility farther down in your description...]

svonworl commented 1 month ago

Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue (AWS doc)

Random idea (with its own set of pros and cons):

An "updater" Java application that accesses the db via our Hibernate interfaces and runs alongside the webservice. It figures out what needs to be updated (via a queue or maybe a periodic db query that returns what's been recently changed, etc) and then updates it. Could update AI topics, collect DOIs, etc. Could be a single monolithic updater with plugins, or separate updaters specialized for each type of update. Good for asynchronous updates that might produce tardy responses if we tried to do the updates in the webservice request handlers. Not as scalable as lambdas, but probably easier to code.

svonworl commented 1 month ago

Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue (AWS doc)

Random idea (with its own set of pros and cons):

An "updater" Java application that accesses the db via our Hibernate interfaces [...]

A variant is that, instead of a separate application, the "updater" is a pool of background priority threads that runs in the webservice application itself. It pulls tasks from a pool and executes them (where a task is something like "update the AI topic for this entry" or "collect the GitHub DOIs for this entry"). The main request thread handler can queue tasks up before it returns, and they'll run asynchronously, later, in their own database session. And/or, tasks can be queued by a thread that inspects the db to determine what needs updates. Or periodically...