Closed theferrit32 closed 1 year ago
For directory format, I'm thinking:
variations:
<bucket>/snapshots/<code-version>/<timestamp>/<variationdescriptorclass>/*.json
<variationdescriptorclass>
will be CanonicalVariationDescriptor
for all of ClinVar.
clinical_assertion:
<bucket>/snapshots/<code-version>/<timestamp>/<statementclass>/*.json
The <statementclass>
will be things like VariationGermlinePathogenicityStatement
, etc
Could also combine the code-version and timestamp into one path segment like <timestamp>-<code-version>
for easier human filtering and sorting by recent, but this might create a lot of clutter.
Running genegraph with the main function argument
snapshot
should read the streams, populate snapshot dbs, and produce snapshot files.Ideally these snapshot files should be written into cloud storage in a directory structure tied to the code version and date of the snapshot run.
The snapshot directory should also have some metadata, like offset of the stream(s) it has read up to when the snapshot was produced.