geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Create a "non-data" full release pipeline (ontology metadata curation only) #382

Open kltm opened 1 month ago

kltm commented 1 month ago

In order to support the joint pipeline with GOA, we want to create a high-frequency high-success pipeline that produces all of the data products that GOA needs to complete their parts of the pipeline.

We want to produce:

In such a way as to enable easy pickup and signalling for GOA.

kltm commented 1 month ago

Current discussion is looking at:

kltm commented 1 month ago

Tagging @pgaudet

kltm commented 1 month ago

Considering doing a full "raw data" release, including Zenodo and a CF endpoint. This may actually be easiest as it mirrors what we already do--basically the first stage of snapshot, plus the second stage's "publish" step. I'll want to check size and whether Zenodo can digest; we want this fully automated and on smooth rails. Maybe skip Zenodo, as we still will have the full release there.

kltm commented 1 month ago

Basing around "raw-data"

kltm commented 3 weeks ago

I'm doing some exploring of a partial run. Looking at what I have, I expect that all raw upstreams and first-order products (excluding blazegraph and solr), to run about 10G. This puts us well under typical limits for Zenodo and our usual publications (which clock in at nearly 50G). If working weekly, this would allow us to use a monthly buffer (or/with S3 lifecycle) or Zenodo as transport without incurring too much overhead or cost.

kltm commented 2 weeks ago

From talking to @pgaudet, I think I'll move raw-data.geneontology.org a little closer to where we want it to be by removing "annotations/" and "blazegraph/".

kltm commented 2 weeks ago

TBD: after talking to Alex, the best way to package and communicate our data for remote processing and re-ingest.