Integrate rocksdb snapshot generation into snapshot startup mode

theferrit32 commented 1 year ago

Running genegraph with the main function argument snapshot should read the streams, populate snapshot dbs, and produce snapshot files.

Ideally these snapshot files should be written into cloud storage in a directory structure tied to the code version and date of the snapshot run.

The snapshot directory should also have some metadata, like offset of the stream(s) it has read up to when the snapshot was produced.

theferrit32 commented 1 year ago

For directory format, I'm thinking:

variations: <bucket>/snapshots/<code-version>/<timestamp>/<variationdescriptorclass>/*.json

<variationdescriptorclass> will be CanonicalVariationDescriptor for all of ClinVar.

clinical_assertion: <bucket>/snapshots/<code-version>/<timestamp>/<statementclass>/*.json

The <statementclass> will be things like VariationGermlinePathogenicityStatement, etc

Could also combine the code-version and timestamp into one path segment like <timestamp>-<code-version> for easier human filtering and sorting by recent, but this might create a lot of clutter.

theferrit32 commented 1 year ago

Blocked by https://github.com/clingen-data-model/architecture/issues/548

clingen-data-model / genegraph

Integrate rocksdb snapshot generation into snapshot startup mode #759