clingen-data-model / clinvar-streams

1 stars 0 forks source link

Add original and fixed release sequence, and python code for validation and re-generation of notifications #73

Closed theferrit32 closed 1 year ago

theferrit32 commented 1 year ago

The validate_ functions and code in the root of the module are pretty bespoke for particular things we wanted to check for. But other functions are generalized and useful for validation and monitoring tasks.

This also includes 3 release mapping files:

  1. The original sequence of release -> dir mappings
  2. The fixed sequence incorporating replaced and inserted releases
  3. The fixed sequence with text notes about what was changed.

Given a release mapping file and a gcloud library authorized to read from the specified bucket, the whole history of release notifications can be re-generated and uploaded to a kafka topic, which is the input to the clinvar-raw stream producer.

tnavatar commented 1 year ago

Could you say a little more about why this code exists and how it's intended to work? If I've been following the thread so far, it has to do with repairing the borked sequence of releases from DSP, but some detail would be helpful.

theferrit32 commented 1 year ago

@tnavatar Yes it is related to updating the diff notification stream with releases inserted into positions in the past, or replacing old runs of releases with newer runs. In one case a release was missed in the original sequence, and in another a release was accidentally re-run, causing an out-of-order diff sequence.

This file describes each change to the sequence of release messages: https://github.com/clingen-data-model/clinvar-streams/blob/diff-stream-repair/stream-repair/broad-dsp-clinvar_release_mappings_FIXED_NOTES.txt

This also gives us a mapping of releases to the directory their files are in in the bucket (though this file won't be auto-updated when future releases are received), and a way to regenerate the notification kafka messages just using this mapping.