Closed theferrit32 closed 1 year ago
Could you say a little more about why this code exists and how it's intended to work? If I've been following the thread so far, it has to do with repairing the borked sequence of releases from DSP, but some detail would be helpful.
@tnavatar Yes it is related to updating the diff notification stream with releases inserted into positions in the past, or replacing old runs of releases with newer runs. In one case a release was missed in the original sequence, and in another a release was accidentally re-run, causing an out-of-order diff sequence.
This file describes each change to the sequence of release messages: https://github.com/clingen-data-model/clinvar-streams/blob/diff-stream-repair/stream-repair/broad-dsp-clinvar_release_mappings_FIXED_NOTES.txt
This also gives us a mapping of releases to the directory their files are in in the bucket (though this file won't be auto-updated when future releases are received), and a way to regenerate the notification kafka messages just using this mapping.
The
validate_
functions and code in the root of the module are pretty bespoke for particular things we wanted to check for. But other functions are generalized and useful for validation and monitoring tasks.This also includes 3 release mapping files:
Given a release mapping file and a gcloud library authorized to read from the specified bucket, the whole history of release notifications can be re-generated and uploaded to a kafka topic, which is the input to the
clinvar-raw
stream producer.