Monitor ClinVar FTP site and provide release notifications

theferrit32 commented 1 year ago

Overview Our goal is to deliver a highly reliable, low maintenance, and near real-time process to produce a notification message that indicates when a new ClinVarVariationRelease_YYYY-MMDD.xml.gz file is posted by ClinVar in their weekly public release directory.

The "ClinVar Release Notification" message should contain the following data

The filename
the clinvar directory (should always be /pub/clinvar/xml/clinvar_variation/weekly_release). this is needed for auditability and to differentiate from initially loaded releases that came from other directories.
the clinvar release date in YYYY-MM-DD format (parsed out of the file name from the YYYY-MMDD portion of the filename)
the file size of the file
the file released datetime in YYYY-MM-DD HH:MM:SS format

ClinVar's weekly public release directory is located at https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/weekly_release/ and contains files that look like this...

filename                                               size             released        last modified
ClinVarVariationRelease_00-latest_weekly.xml.gz     2,100,131,203   2022-10-24 17:46:13 2022-10-24 17:46:13
ClinVarVariationRelease_00-latest_weekly.xml.gz.md5 139     2022-10-24 17:46:38 2022-10-24 17:46:38
ClinVarVariationRelease_2022-1009.xml.gz        2,095,778,213   2022-10-11 11:51:22 2022-10-11 11:51:22
ClinVarVariationRelease_2022-1009.xml.gz.md5        132     2022-10-11 11:51:48 2022-10-11 11:51:48
ClinVarVariationRelease_2022-1015.xml.gz        2,098,790,558   2022-10-16 12:29:20 2022-10-16 12:29:20
ClinVarVariationRelease_2022-1015.xml.gz.md5        132     2022-10-16 12:29:43 2022-10-16 12:29:43
ClinVarVariationRelease_2022-1022.xml.gz        2,100,131,203   2022-10-24 17:46:13 2022-10-24 17:46:13
ClinVarVariationRelease_2022-1022.xml.gz.md5        132     2022-10-24 17:46:33 2022-10-24 17:46:33

Details

The purpose of this service is to provide a source of truth for when ClinVar releases new weekly files.
These files are the source of the ClinGen ClinVar ingest service created by DSP.
This service is the source for all ClinVar representation that is managed by ClinGen.
ClinGen is aiming to offer an accurate and up-to-date representation of ClinVar along with additional annotations and curations to help improve the utility of the ClinVar data.
The ClinGen management of ClinVar data cannot risk getting out of sync by having missed or misordered processing of these weekly releases.

Release frequency requirement A ClinVar release file includes only the date in its name. As a consequence, a second ClinVar release within a single day requires overwriting a release file. (Kyle is not sure that has ever happened.) However, what may occur is that they process back-to-back releases on subsequent days but publicly release the files within the same day. So we should be handling that scenario.

Why not rely solely on DSP's ClinVar ignest? DSP's ClinVar ingest service performs polling and ingest of these same files, but this process is run daily and is out of our control. While we could work with DSP to help harden the reliability of their service we would always be blind to any potential issues until after they occur. This service gives us a reasonably low-maintenance opportunity to verify the order and comprehensiveness of the datasets identified and processed by the DSP ClinVar service along with other side benefits of getting push notifications of when ClinVar releases new updates to their public dataset.

Pre-production issues have already occurred The initial loading and automated DSP processing has already created 3 cases of missed or misordered files. Such as what happened with 2022-03-30 -> 2022-04-03 -> 2022-03-30 -> 2022-04-13 data that is diffed based on an incorrect sequence of release files and skipping a file as was done on 2022-06-20. While we can potentially halt processing if we notice misordered dates, we cannot undo the fact that we've already processed the future date before its predecessor. Additionally, we have no way of identifying if a file was missed completely.

Initial loading In order to keep this service simple we only need it to monitor the weekly public clinvar relaase directory. However, it would be prudent for us to load the notification stream with the historical releases that have already been processed so that this message stream can be a comprehensive set of clinvar release data that ClinGen has processed. This can be done with a manually curated set of messages (see attached list of historical clinvar release data in clinvar_releases_pre_20221027.txt).

Other ideas for use of this service In addition to the clinvar-raw producer in kubernetes, a notification message should be sent to a Slack channel such as clingen-alerts saying that a new ClinVar release has appeared. This can either be by sending it to the Slack api programmatically (that pod would need to also be provided Slack credentials), or by logging a structured message and using Google cloud logging to pick it up and send an alert to Slack.

theferrit32 commented 1 year ago

NOTE: I added all the key points from below to the main description above. Please use that information as the source of truth.

--- original posting before the main description was updated --- We are using the variation-oriented files under clinvar_variation on the ClinVar FTP site. http: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/ ftp: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/

The release_date DSP sets in their ClinVar datasets are based on the name of the files. But we may also want to capture the precise file timestamp of when it was actually created.

The set of existing release files we want from the FTP site is the union of:

archive/*/ClinVarVariationRelease_\d{4}-\d{2}.xml.gz (excluding any with dates earlier than 2019-07-01)
ClinVarVariationRelease_\d{4}-\d{2}.xml.gz (top level of clinvar_variation)
weekly_release/ClinVarVariationRelease_\d{4}-\d{2}\d{2}.xml.gz

The weekly_release directory is the one that receives new files. The other files are only used to initialize the sequence of past releases. When a never before seen file appears in the weekly releases, it is the latest release, and gets ingested by the DSP ingest pipeline.

The weekly releases also only contain the weeks produced in the current month. When the month ends, the last weekly yyyy-mm-dd release in the month is set as the yyyy-mm monthly release for that month in the clinvar_variation directory (and maybe eventually moved to archive). Then the weekly_releases directory is cleared and will start receiving weekly releases for the next month. This is why it is important we capture and durably store the weekly releases. So far we (DSP) has not missed transferring any weekly release xml files to terra storage since they began the cron job.

tbl3rd commented 1 year ago

@theferrit32 says: FTP and HTTP are our only access to https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/weekly_release/ I found this too. https://github.com/DataBiosphere/clinvar-ingest/blob/20f103a35d45cdbed88e7fdd0948553e7570b085/orchestration/templates/ingest-xml-archive.yaml#L10-L17

toneillbroad commented 7 months ago

Closing due to refactoring around clinvar-ingest V2.

clingen-data-model / clinvar-streams

Monitor ClinVar FTP site and provide release notifications #70

NOTE: I added all the key points from below to the main description above. Please use that information as the source of truth.