Closed melissacline closed 7 years ago
And in the same package, the Germline_or_Somatic_BIC field: Germline_or_Somatic_BIC variant chr17:43106472:TC>T major change: G - Germline_or_Somatic_BIC variant chr17:43106488:CT>C major change: G - Germline_or_Somatic_BIC variant chr17:43106523:GC>G major change: G - Germline_or_Somatic_BIC variant chr17:43115729:CA>C major change: G - Germline_or_Somatic_BIC variant chr17:43115746:CTT>C major change: G - Germline_or_Somatic_BIC variant chr17:43115775:CCA>C major change: G - Germline_or_Somatic_BIC variant chr17:43115789:TA>T major change: G -
I do see these changes in the built.tsv and aggregated.tsv files you sent to me, but after creating those same files myself, I see no changes for Germline_or_Somatic_BIC or Clinical_classification_BIC. I made some changes to variant-merging.py and brca_pseudonym_generator.py that may have resolved those issues.
Please disregard my previous comment -- i see that aggregated.tsv is from the old dataset and built.tsv is from the new dataset. Interestingly, running releaseDiff.py against the new aggregated.tsv and built.tsv found several clinvar classification changes, e.g.:
Clinical_Significance_ClinVar variant chr17:43051061:ACCT>AATGTTG major change: - Pathogenic
. Will explore further.
Some updates:
Regarding Germline_or_Somatic_BIC
After tracking chr17:43106472:TC>T all the way back to the bic vcf file, it looks like 2 variants at position 43106472 get merged into a single variant at position 43106471 and then make their way into built.tsv as chr17:43106470:A>AT with the expected Germline_or_Somatic_BIC property of G. Meanwhile, chr17:43106472:TC>T looks like it was derived from Clinvar and does not have a Germline_or_Somatic_BIC property anywhere from the original vcf all the way through the merging process. There are 2 separate variants in the new built.tsv file, one from BIC and one from ClinVar.
In the old data, chr17:43106470:A>AT does not exist, but chr17:43106472:TC>T is a single variant derived from both BIC and ClinVar.
It seems that every variant listed above is an example of a variant that was being merged in the old data but is not merged in the new data. I don't have enough information to know if any of these variants should be merged or not.
Regarding Clinical_classification_BIC
It looks like this is the same issue as with G_or_S. Essentially, variants that were merged in the old data are no longer merged.
Proposed Action
Review when merges should and shouldn't happen and make any necessary adjustments.
Here are some examples, from releaseDiff.py
Clinical_classification_BIC variant chr17:43074347:CAAGT>C major change: Class 5 - Clinical_classification_BIC variant chr17:43074427:C>T major change: - Pending Clinical_classification_BIC variant chr17:43074489:TC>T major change: Class 5 - Clinical_classification_BIC variant chr17:43076488:CTT>C major change: Class 5 - Clinical_classification_BIC variant chr17:43076578:ATAG>AAA major change: Class 5 - Clinical_classification_BIC variant chr17:43082460:C>CT major change: - Class 5 Clinical_classification_BIC variant chr17:43082508:AAC>A major change: Class 5 - Clinical_classification_BIC variant chr17:43082564:GGT>G major change: Class 5 - Clinical_classification_BIC variant chr17:43090999:CTT>C major change: Class 5 - Clinical_classification_BIC variant chr17:43091005:TCA>T major change: Class 5 - Clinical_classification_BIC variant chr17:43091007:ACT>A major change: Class 5 -