Closed theferrit32 closed 7 months ago
--- original posting before the main description was updated ---
We are using the variation-oriented files under clinvar_variation
on the ClinVar FTP site.
http: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/
ftp: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/
The release_date
DSP sets in their ClinVar datasets are based on the name of the files. But we may also want to capture the precise file timestamp of when it was actually created.
The set of existing release files we want from the FTP site is the union of:
archive/*/ClinVarVariationRelease_\d{4}-\d{2}.xml.gz
(excluding any with dates earlier than 2019-07-01)ClinVarVariationRelease_\d{4}-\d{2}.xml.gz
(top level of clinvar_variation)weekly_release/ClinVarVariationRelease_\d{4}-\d{2}\d{2}.xml.gz
The weekly_release
directory is the one that receives new files. The other files are only used to initialize the sequence of past releases. When a never before seen file appears in the weekly releases, it is the latest release, and gets ingested by the DSP ingest pipeline.
The weekly releases also only contain the weeks produced in the current month. When the month ends, the last weekly yyyy-mm-dd
release in the month is set as the yyyy-mm
monthly release for that month in the clinvar_variation
directory (and maybe eventually moved to archive). Then the weekly_releases
directory is cleared and will start receiving weekly releases for the next month. This is why it is important we capture and durably store the weekly releases. So far we (DSP) has not missed transferring any weekly release xml files to terra storage since they began the cron job.
@theferrit32 says: FTP and HTTP are our only access to https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/weekly_release/ I found this too. https://github.com/DataBiosphere/clinvar-ingest/blob/20f103a35d45cdbed88e7fdd0948553e7570b085/orchestration/templates/ingest-xml-archive.yaml#L10-L17
Closing due to refactoring around clinvar-ingest V2.
Overview Our goal is to deliver a highly reliable, low maintenance, and near real-time process to produce a notification message that indicates when a new ClinVarVariationRelease_YYYY-MMDD.xml.gz file is posted by ClinVar in their weekly public release directory.
The "ClinVar Release Notification" message should contain the following data
ClinVar's weekly public release directory is located at https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/weekly_release/ and contains files that look like this...
Details
Release frequency requirement A ClinVar release file includes only the date in its name. As a consequence, a second ClinVar release within a single day requires overwriting a release file. (Kyle is not sure that has ever happened.) However, what may occur is that they process back-to-back releases on subsequent days but publicly release the files within the same day. So we should be handling that scenario.
Why not rely solely on DSP's ClinVar ignest? DSP's ClinVar ingest service performs polling and ingest of these same files, but this process is run daily and is out of our control. While we could work with DSP to help harden the reliability of their service we would always be blind to any potential issues until after they occur. This service gives us a reasonably low-maintenance opportunity to verify the order and comprehensiveness of the datasets identified and processed by the DSP ClinVar service along with other side benefits of getting push notifications of when ClinVar releases new updates to their public dataset.
Pre-production issues have already occurred The initial loading and automated DSP processing has already created 3 cases of missed or misordered files. Such as what happened with 2022-03-30 -> 2022-04-03 -> 2022-03-30 -> 2022-04-13 data that is diffed based on an incorrect sequence of release files and skipping a file as was done on 2022-06-20. While we can potentially halt processing if we notice misordered dates, we cannot undo the fact that we've already processed the future date before its predecessor. Additionally, we have no way of identifying if a file was missed completely.
Initial loading In order to keep this service simple we only need it to monitor the weekly public clinvar relaase directory. However, it would be prudent for us to load the notification stream with the historical releases that have already been processed so that this message stream can be a comprehensive set of clinvar release data that ClinGen has processed. This can be done with a manually curated set of messages (see attached list of historical clinvar release data in clinvar_releases_pre_20221027.txt).
Other ideas for use of this service In addition to the clinvar-raw producer in kubernetes, a notification message should be sent to a Slack channel such as
clingen-alerts
saying that a new ClinVar release has appeared. This can either be by sending it to the Slack api programmatically (that pod would need to also be provided Slack credentials), or by logging a structured message and using Google cloud logging to pick it up and send an alert to Slack.related: https://github.com/clingen-data-model/clinvar-streams/issues/1