clingen-data-model / clinvar-ingest

Apache License 2.0
0 stars 0 forks source link

Validate record counts between DSP ClinVar ingest and ClinGen Ingest #139

Open toneillbroad opened 1 month ago

toneillbroad commented 1 month ago

We need to ensure that the new ClinGen ClinVar ingest is producing the same volume of data as the current DSP ClinVar ingest. This should be done by verifying the record counts in the the BQ tables:

toneillbroad commented 1 month ago

SELECT table_id, row_count FROM clingen-dev.clinvar_2024_05_19_v1_0_0_alpha.__TABLES__

clinical_assertion: 4391326 clinical_assertion_observation: 4446560 clinical_assertion_trait: 5069115 clinical_assertion_trait_set: 4475924 gene: 5265624 gene_association: 5265624 rcv_accession: 3899541 submission: 4391326 submitter: 4391326 trait: 4246359 trait_mapping: 5059749 trait_set: 3899535 variation: 2969292 variation_archive: 2969292

=====

SELECT table_id, row_count FROM clingen-stage.clinvar_2024_05_19_v1_6_62.__TABLES__

clinical_assertion: 4391326 clinical_assertion_observation: 4446560 clinical_assertion_trait: 5069115 clinical_assertion_trait_set: 4475924 clinical_assertion_variation: 4403506 datarepo_row_ids: 43105776 gene: 92252 gene_association: 5265624 processing_history: 1 rcv_accession: 3899541 submission: 17335 submitter: 2867 trait: 20727 trait_mapping: 5059749 trait_set: 23393 variation: 2969292 variation_archive: 2968563 xml_archive: 1

The following tables are discrepant: clinical_assertion_variation gene submission submitter trait trait_set variation_archive

toneillbroad commented 3 weeks ago

clinical_assertion_variation table - ClinGen ingest missing 12 records compared to DSP. This is due to defect in Genotype processing ticket. This needs to be fixed.

variation_archive table - ClinGen has 728 more records than DSP. All 728 are variants as part of a Haplotype or Genotype which Larry told DSP not to process. He is OK with us including the records. This will not be fixed.