clingen-data-model / clinvar-ingest

Apache License 2.0
2 stars 0 forks source link

Add final dataset name to the processing_history table after bq-ingest runs #233

Closed theferrit32 closed 8 hours ago

theferrit32 commented 1 month ago

Add a column to processing_history to store the name of the final BQ dataset a set of pipeline output files was loaded into.

e.g. ingesting clinvar_vcv_2024_10_10_kyle_dev and clinvar_rcv_2024_10_10_kyle_dev each with xml_release_date=2024-10-10 creates a dataset called clinvar_2024_10_10_kyle_dev. This is included in the slack message but not persisted into the processing_history table. It can be inferred using the same logic that was used to decide it in bq-ingest, but persisting it in a column makes it easy to lookup from queries.

Store it both in the vcv and the rcv row.

This field can be set during the same UPDATE that runs to set the release_date field at the end of bq-ingest.

    UPDATE clingen-dev.clinvar_kyle.processing_history
    SET release_date = '2024-10-10',
        final_dataset = 'clinvar_2024_10_10_kyle_dev'
    WHERE file_type = 'rcv'
    AND pipeline_version = 'kyle_dev'
    AND xml_release_date = '2024-10-10'
    AND bucket_dir = 'clinvar_rcv_2024_10_10_kyle_dev