Open kltm opened 4 months ago
Current discussion is looking at:
Tagging @pgaudet
Considering doing a full "raw data" release, including Zenodo and a CF endpoint. This may actually be easiest as it mirrors what we already do--basically the first stage of snapshot, plus the second stage's "publish" step. I'll want to check size and whether Zenodo can digest; we want this fully automated and on smooth rails. Maybe skip Zenodo, as we still will have the full release there.
Basing around "raw-data"
I'm doing some exploring of a partial run. Looking at what I have, I expect that all raw upstreams and first-order products (excluding blazegraph and solr), to run about 10G. This puts us well under typical limits for Zenodo and our usual publications (which clock in at nearly 50G). If working weekly, this would allow us to use a monthly buffer (or/with S3 lifecycle) or Zenodo as transport without incurring too much overhead or cost.
From talking to @pgaudet, I think I'll move raw-data.geneontology.org a little closer to where we want it to be by removing "annotations/" and "blazegraph/".
TBD: after talking to Alex, the best way to package and communicate our data for remote processing and re-ingest.
In order to support the joint pipeline with GOA, we want to create a high-frequency high-success pipeline that produces all of the data products that GOA needs to complete their parts of the pipeline.
We want to produce:
In such a way as to enable easy pickup and signalling for GOA.