Harvest DOIs from Zenodo Proof of Concept

Description This is very drafty; I'm looking for feedback on the overall concept, and whether we should go down this route. FWIW, I think we should.

The code looks for Zenodo DOIs against GitHub repos for published workflows. It found DOIs against 89 repos that have registered workflows in Dockstore. 19 of those repos have more than 1 workflow, we would not be able to tell which workflow(s) the DOI applies to.

To complete this PR:

Fetch all the DOI versions related to a single DOI (another Zenodo call)
Assign the DOIs to workflow versions

Things to figure out

Some of these also come up in #5879

Do we snapshot a version before assigning the DOI? I would argue no.
How do we handle a DOI against a repo with multiple workflows?
- Assign the DOI to all workflows in the repo?
- Ignore the DOI, i.e., don't assign it. This is my preference at least for the first iteration of this.
- Some UI that lets the user choose with workflow(s) to apply a DOI to.
Do we need to support multiple DOIs for a version?
- If no, do we just silently fail if there is already a DOI assigned?
We probably need to track the "source" of the DOI. For example, Kathy is working on code to issue editing urls for DOIs created by Dockstore with the Dockstore Zenodo account; that won't make sense for these DOIs.
We can run the endpoint from this PR when we deploy 1.16 to capture existing DOIs, but how do we capture DOIs created after the deploy? In my few tests, there the creation of the DOI by the Zenodo/GitHub creation took between a few seconds to several minutes to complete. Which means we can't assume the DOI already exists when we get notified by GitHub apps that a tag has been created. Some ideas:
- We invoke the above endpoint on a schedule, e.g., nightly. How do we do that?
- We could track creation of tags and invoke a variant of the endpoint for just the known new tags.

Review Instructions

Issue dockstore/dockstore#5745

Security and Privacy

If there are any concerns that require extra attention from the security team, highlight them here and check the box when complete.

[ ] Security and Privacy assessed

e.g. Does this change...

Any user data we collect, or data location?
Access control, authentication or authorization?
Encryption features?

Please make sure that you've checked the following before submitting your pull request. Thanks!

[ ] Check that you pass the basic style checks and unit tests by running mvn clean install
[ ] Ensure that the PR targets the correct branch. Check the milestone or fix version of the ticket.
[ ] Follow the existing JPA patterns for queries, using named parameters, to avoid SQL injection
[ ] If you are changing dependencies, check the Snyk status check or the dashboard to ensure you are not introducing new high/critical vulnerabilities
[ ] Assume that inputs to the API can be malicious, and sanitize and/or check for Denial of Service type values, e.g., massive sizes
[ ] Do not serve user-uploaded binary images through the Dockstore API
[ ] Ensure that endpoints that only allow privileged access enforce that with the @RolesAllowed annotation
[ ] Do not create cookies, although this may change in the future
[ ] If this PR is for a user-facing feature, create and link a documentation ticket for this feature (usually in the same milestone as the linked issue). Style points if you create a documentation PR directly and link that instead.

Ran it on all our workflows. Found 71 repos referenced by DOIs that only have 1 workflow in Dockstore. Found an additional 18 repos referenced by DOIs, that have more than 1 workflow.

Here is the result for repos with 1 workflow:

[
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10901674"
    ],
    "repo": "nf-core/airrflow"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4208836"
    ],
    "repo": "h3abionet/TADA"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10846111"
    ],
    "repo": "nf-core/funcscan"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10687430"
    ],
    "repo": "nf-core/eager"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10986616"
    ],
    "repo": "nf-core/riboseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10463781"
    ],
    "repo": "nf-core/methylseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10643212"
    ],
    "repo": "nf-core/circdna"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10911752"
    ],
    "repo": "nf-core/metatdenovo"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7643948"
    ],
    "repo": "nf-core/phyloplace"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10707294"
    ],
    "repo": "nf-core/epitopeprediction"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8164980"
    ],
    "repo": "nf-core/viralintegration"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.5104871"
    ],
    "repo": "denis-yuen/galaxy-workflow-dockstore-example-2"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7764938",
      "https://doi.org/10.5281/zenodo.3746584"
    ],
    "repo": "nf-core/viralrecon"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4723017"
    ],
    "repo": "nf-core/clipseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10651816"
    ],
    "repo": "nf-core/mag"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8427707"
    ],
    "repo": "nf-core/mhcquant"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7220729"
    ],
    "repo": "nf-core/hlatyping"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6515313"
    ],
    "repo": "nf-core/hicar"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10696391"
    ],
    "repo": "nf-core/smrnaseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7139814"
    ],
    "repo": "nf-core/chipseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4106005"
    ],
    "repo": "nf-core/proteomicslfq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10554425"
    ],
    "repo": "nf-core/scrnaseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10471647"
    ],
    "repo": "nf-core/rnaseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.11126488"
    ],
    "repo": "nf-core/sarek"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10650749"
    ],
    "repo": "nf-core/molkart"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10783110"
    ],
    "repo": "nf-core/nascent"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10952554"
    ],
    "repo": "nf-core/rnafusion"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6669637"
    ],
    "repo": "nf-core/rnavar"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4469317"
    ],
    "repo": "gatk-workflows/gatk4-data-processing"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6403063"
    ],
    "repo": "kathy-t/workflow-dockstore-yml"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.5104898"
    ],
    "repo": "Richard-Hansen/dockstore-whalesay"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7994878"
    ],
    "repo": "nf-core/hic"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10036158"
    ],
    "repo": "nf-core/bacass"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10209675"
    ],
    "repo": "nf-core/differentialabundance"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7716033"
    ],
    "repo": "nf-core/nanoseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7689178"
    ],
    "repo": "denis-yuen/test-workflows-and-tools"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10728509"
    ],
    "repo": "nf-core/fetchngs"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10124950"
    ],
    "repo": "kathy-t/SRANWRP"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7941033"
    ],
    "repo": "nf-core/hgtseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.5039442"
    ],
    "repo": "dockstore-personal-testing/gatk4-exome-analysis-pipeline-flat"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4304953"
    ],
    "repo": "Richard-Hansen/hello_world"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.3508160",
      "https://doi.org/10.5281/zenodo.3928817",
      "https://doi.org/10.5281/zenodo.3401699"
    ],
    "repo": "wshands/hmmer-docker"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7080256"
    ],
    "repo": "david4096/autopotato-attack"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8222875"
    ],
    "repo": "nf-core/atacseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.2628872"
    ],
    "repo": "ICGC-TCGA-PanCancer/Seqware-BWA-Workflow"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6141389"
    ],
    "repo": "Richard-Hansen/dockstore-tool-helloworld"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.5104873"
    ],
    "repo": "garyluu/example_cwl_workflow"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.1491630"
    ],
    "repo": "nf-core/deepvariant"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.3571864"
    ],
    "repo": "nf-core/neutronstar"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.2582812"
    ],
    "repo": "SciLifeLab/Sarek"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4536530"
    ],
    "repo": "Richard-Hansen/dockstore-whalesay2"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10668725"
    ],
    "repo": "ENCODE-DCC/atac-seq-pipeline"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10718449"
    ],
    "repo": "nf-core/demultiplex"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8354570"
    ],
    "repo": "iwc-workflows/rnaseq-pe"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10067429"
    ],
    "repo": "nf-core/quantms"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7629996"
    ],
    "repo": "nf-core/proteinfold"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10622411"
    ],
    "repo": "nf-core/readsimulator"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10868876"
    ],
    "repo": "nf-core/raredisease"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10912278"
    ],
    "repo": "nf-core/ampliseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10527467"
    ],
    "repo": "nf-core/nanostring"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8159051"
    ],
    "repo": "nf-core/marsseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10634361"
    ],
    "repo": "nf-core/taxprofiler"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8354480"
    ],
    "repo": "iwc-workflows/rnaseq-sr"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6403310"
    ],
    "repo": "kathy-t/hello-wdl-workflow"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10895229"
    ],
    "repo": "nf-core/pixelator"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10606804"
    ],
    "repo": "nf-core/cutandrun"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10877148"
    ],
    "repo": "nf-core/detaxizer"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4540719"
    ],
    "repo": "nf-core/dualrnaseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10406093"
    ],
    "repo": "nf-core/metaboigniter"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8414663"
    ],
    "repo": "nf-core/bamtofastq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10696998"
    ],
    "repo": "nf-core/rnasplice"
  }
]

Do we snapshot a version before assigning the DOI? I would argue no.

I think no too. I think I'm leaning toward @svonworl 's (?) idea to have more database structure around how we track DOIs and track user-generated, auto-generated, and github harvested(?) DOIs separately.

This is my preference at least for the first iteration of this.

Works for me, also ok with the "let the user choose" For most purposes, I'd think that "let the user choose from a list of likely suspects" would be ok as a first/second pass too, just not a final pass

We probably need to track the "source" of the DOI. For example, Kathy is working on code to issue editing urls for DOIs created by Dockstore with the Dockstore Zenodo account; that won't make sense for these DOIs.

Kinda feel like they should just be separately tracked/different classes.

We invoke the above endpoint on a schedule, e.g., nightly. How do we do that?

Seems familiar @kathy-t More for a periodic listener, a registration method for cron-like events, etc. (along with topic generation, metric aggregation, etc.) Unless we do it all externally like with the ECS cron

I think I'm leaning toward @svonworl 's (?) idea to have more database structure around how we track DOIs and track user-generated, auto-generated, and github harvested(?) DOIs separately.

I'm working on implementing this in my PR to allow users to generate their own DOIs for a workflow that already has Dockstore DOIs

We invoke the above endpoint on a schedule, e.g., nightly. How do we do that? Seems familiar @kathy-t More for a periodic listener, a registration method for cron-like events, etc. (along with topic generation, metric aggregation, etc.) Unless we do it all externally like with the ECS cron

Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue (AWS doc)

Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue

May be overkill, just like with topics, it should be possible to just compute which tags are eligible for DOIs and just process them without needing to keep any extra state around.

19 of those repos have more than 1 workflow, we would not be able to tell which workflow(s) the DOI applies to.

The "GitHub" DOIs seem to reference the repo, rather than a particular entry within. So, maybe said DOIs reference all entries in the repo? [Postscript: just noticed you mentioned this possibility farther down in your description...]

Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue (AWS doc)

Random idea (with its own set of pros and cons):

An "updater" Java application that accesses the db via our Hibernate interfaces and runs alongside the webservice. It figures out what needs to be updated (via a queue or maybe a periodic db query that returns what's been recently changed, etc) and then updates it. Could update AI topics, collect DOIs, etc. Could be a single monolithic updater with plugins, or separate updaters specialized for each type of update. Good for asynchronous updates that might produce tardy responses if we tried to do the updates in the webservice request handlers. Not as scalable as lambdas, but probably easier to code.

Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue (AWS doc)

Random idea (with its own set of pros and cons):

An "updater" Java application that accesses the db via our Hibernate interfaces [...]

A variant is that, instead of a separate application, the "updater" is a pool of background priority threads that runs in the webservice application itself. It pulls tasks from a pool and executes them (where a task is something like "update the AI topic for this entry" or "collect the GitHub DOIs for this entry"). The main request thread handler can queue tasks up before it returns, and they'll run asynchronously, later, in their own database session. And/or, tasks can be queued by a thread that inspects the db to determine what needs updates. Or periodically...

dockstore / dockstore

Harvest DOIs from Zenodo Proof of Concept #5880