geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Create a "non-data" full release pipeline (ontology metadata curation only) #382

Open kltm opened 4 months ago

kltm commented 4 months ago

In order to support the joint pipeline with GOA, we want to create a high-frequency high-success pipeline that produces all of the data products that GOA needs to complete their parts of the pipeline.

We want to produce:

In such a way as to enable easy pickup and signalling for GOA.

kltm commented 4 months ago

Current discussion is looking at:

kltm commented 4 months ago

Tagging @pgaudet

kltm commented 4 months ago

Considering doing a full "raw data" release, including Zenodo and a CF endpoint. This may actually be easiest as it mirrors what we already do--basically the first stage of snapshot, plus the second stage's "publish" step. I'll want to check size and whether Zenodo can digest; we want this fully automated and on smooth rails. Maybe skip Zenodo, as we still will have the full release there.

kltm commented 3 months ago

Basing around "raw-data"

kltm commented 3 months ago

I'm doing some exploring of a partial run. Looking at what I have, I expect that all raw upstreams and first-order products (excluding blazegraph and solr), to run about 10G. This puts us well under typical limits for Zenodo and our usual publications (which clock in at nearly 50G). If working weekly, this would allow us to use a monthly buffer (or/with S3 lifecycle) or Zenodo as transport without incurring too much overhead or cost.

kltm commented 3 months ago

From talking to @pgaudet, I think I'll move raw-data.geneontology.org a little closer to where we want it to be by removing "annotations/" and "blazegraph/".

kltm commented 3 months ago

TBD: after talking to Alex, the best way to package and communicate our data for remote processing and re-ingest.