clingen-data-model / clinvar-ingest

Apache License 2.0
1 stars 0 forks source link

Write a function in clinvar-ingest to find the proper BQ dataset to write tables to #196

Closed toneillbroad closed 3 weeks ago

toneillbroad commented 1 month ago

The clinvar VCV and RCV ingests will be invoked by their respective FTP watcher processes with the release date information derived from their respective filenames.

For example: ClinVarVariationRelease_2024-0611.xml.gz -> Release Date: "2024-06-11" ClinVarRCVRelease_2024-0610.xml.gz -> Release Date: "2024-06-10"

The implications of this is that the parsed files that are the basis of the external BQ tables are generated in a GCP bucket with the release date as part of the name.

ClinVarVariationRelease_2024-0611.xml.gz -> gs://clinvar-ingest/executions/clinvar_2024_06_11_v1_0_0_alpha

With the VCV and RCV files above having different release dates we need to ensure that the parsed files are imported into the proper BQ dataset name as it constains the date as part of the dataset name:

ClinVarVariationRelease_2024-0611.xml.gz -> Dataset ID: clingen-dev.clinvar_2024_06_11_v1_0_0_alpha

A fundamental tenet of the RCV and VCV ingests are that they are not dependent on one of the files always coming first and the other second - but rather they should process in the order in which they appear on the ClinVar FTP site.

Therefore, the strategy for determining the proper BQ dataset to ingest the parsed files into is as follows:

  1. When a new ingest file (either RCV or VCV) is detected on the ClinVar FTP site it is parsed as normal into the gs://clinvar-ingest bucket
  2. When attempting to import the parsed file(s) into a BQ dataset, both the ingest processes (RCV and VCV) need to check to see if there exists a BQ dataset within 1 day forward and back of the release date of the file being processed.
  3. If no such dataset exists, the current process should create one and continue processing in that dataset.
  4. When the second ingest process comes along, it should perform the same one-day forward/back search for a dataset based on the release date of its file. It should find the BQ dataset of the previous ingest and use that as the dataset to process in.

Perhaps we should consider appending "_vcv" or "_rcv" to the gs://clinvar-ingest/executions parsed data files: gs://clinvar-ingest/executions/clinvar_2024_06_24_v1_0_0_alpha -> gs://clinvar-ingest/executions/clinvar_2024_06_24_v1_0_0_alpha_vcv

theferrit32 commented 1 month ago

We need to think a little bit about what do to when a "fix" release is published by ClinVar. For example sometimes they publish a release again the next day to censor PHI or fix data published the day before. (I think I've seen them publish a fix release within the same day).

I'm not sure if the FTP watcher will detect the fix file as being a "new" file to be processed.

I think the "fix" release even if published the next day will have the ReleaseDate and file name being the same as the release file it is fixing. But the modified date on the file metadata will be the new timestamp.

toneillbroad commented 3 weeks ago

Closing due to this being re-issued as this epic: https://app.zenhub.com/workspaces/genegraphdxclinvar-60340fb9898dae001107e94e/issues/gh/clingen-data-model/clinvar-curation-input-tool/85