clingen-data-model / genegraph

Presents an RDF triplestore of gene information using GraphQL APIs
5 stars 0 forks source link

Integrate rocksdb snapshot generation into snapshot startup mode #759

Closed theferrit32 closed 1 year ago

theferrit32 commented 1 year ago

Running genegraph with the main function argument snapshot should read the streams, populate snapshot dbs, and produce snapshot files.

Ideally these snapshot files should be written into cloud storage in a directory structure tied to the code version and date of the snapshot run.

The snapshot directory should also have some metadata, like offset of the stream(s) it has read up to when the snapshot was produced.

theferrit32 commented 1 year ago

For directory format, I'm thinking:

variations: <bucket>/snapshots/<code-version>/<timestamp>/<variationdescriptorclass>/*.json

<variationdescriptorclass> will be CanonicalVariationDescriptor for all of ClinVar.

clinical_assertion: <bucket>/snapshots/<code-version>/<timestamp>/<statementclass>/*.json

The <statementclass> will be things like VariationGermlinePathogenicityStatement, etc

Could also combine the code-version and timestamp into one path segment like <timestamp>-<code-version> for easier human filtering and sorting by recent, but this might create a lot of clutter.

theferrit32 commented 1 year ago

Blocked by https://github.com/clingen-data-model/architecture/issues/548