Closed torbjorgen closed 1 year ago
Hi @jemten and @henrikstranneheim,
This is my take on the rank score normalization. Please have a look and see if this is what you expected.
I also need some input on how to test this, since the test coverage in GENMOD does not cover all scoring configs available in MIP pipeline. Should I add new tests to GENMOD to cover the MIP scoring functionality (a bit awkward) or should we run the tests in MIP instead?
Pasting an example output (after completing compound scoring):
##fileformat=VCFv4.1
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=Annotation,Number=.,Type=String,Description="Annotates what feature(s) this variant belongs to.">
##INFO=<ID=1000GAF,Number=1,Type=Float,Description="Frequency in the 1000G database.">
##INFO=<ID=CADD,Number=1,Type=Integer,Description="The CADD relative score for this alternative.">
##INFO=<ID=GeneticModels,Number=.,Type=String,Description="':'-separated list of genetic models for this variant.">
##INFO=<ID=ModelScore,Number=.,Type=String,Description="PHRED score for genotype models.">
##INFO=<ID=Compounds,Number=.,Type=String,Description="List of compound pairs for this variant.The list is splitted on ',' family id is separated with compoundswith ':'. Compounds are separated with '|'.">
##INFO=<ID=RankScore,Number=.,Type=String,Description="The rank score for this variant in this family. family_id:rank_score.">
##INFO=<ID=RankScoreNormalized,Number=.,Type=String,Description="The normalized rank score in range(0, 1) for this variant in this family. family_id:rank_score.">
##INFO=<ID=RankScoreMinMax,Number=.,Type=String,Description="The rank score MIN-MAX bounds. family_id:min:max.">
##contig=<ID=1,length=249250621,assembly=b37>
##reference=file:///humgen/gsa-hpprojects/GATK/bundle/current/b37/human_g1k_v37.fasta
##Software=<ID=genmod,Version=-1.-1.-1,Date="2023-05-04 07:42",CommandLineOptions="context=<click.core.Context object at 0x7fa61b579a50> variant_file=result-annotate.vcf family_file=<_io.TextIOWrapper name='tests/fixtures/recessive_trio.ped' mode='r' encoding='UTF-8'> family_type=ped keyword=Annotation processes=3">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT father mother proband father_2 mother_2 proband_2
1 879537 . T C 100 PASS MQ=1;Annotation=SAMD11,NOC2L;Annotation=SAMD11,NOC2L;1000GAF=0.000199681;CADD=1.248;GeneticModels=1:AR_hom;ModelScore=1:55;RankScore=1:6;RankScoreNormalized=1:0.8823529411764706;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/1:10,10:60 0/1:10,10:60 1/1:10,10:60 0/0:10,10:60 0/1:10,10:60 1/1:10,10:60
1 879541 . G A 100 PASS MQ=1;Annotation=SAMD11,NOC2L;Annotation=SAMD11,NOC2L;1000GAF=0.000599042;CADD=4.003;GeneticModels=1:AR_hom|AR_hom_dn;ModelScore=1:57;RankScore=1:6;RankScoreNormalized=1:0.8823529411764706;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ ./. 0/1:10,10:60 1/1:10,10:60 ./. 0/1:10,10:60 0/1:10,10:60
1 879595 . C T 100 PASS MQ=1;Annotation=SAMD11,NOC2L;Annotation=SAMD11,NOC2L;1000GAF=0.000399361;CADD=8.271;GeneticModels=1:AR_hom_dn;ModelScore=1:55;RankScore=1:6;RankScoreNormalized=1:0.8823529411764706;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/1:10,10:60 0/0:10,10:60 1/1:10,10:60 0/1:10,10:60 0/0:10,10:60 0/1:10,10:60
1 879676 . G A 100 PASS MQ=1;Annotation=SAMD11,NOC2L;Annotation=SAMD11,NOC2L;1000GAF=0.885982;CADD=7.019;RankScore=1:-23;RankScoreNormalized=1:0.029411764705882353;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/1:10,10:60 1/1:10,10:60 1/1:10,10:60 0/1:10,10:60 0/1:10,10:60 0/1:10,10:60
1 879911 . G A 100 PASS MQ=1;Annotation=SAMD11,NOC2L;Annotation=SAMD11,NOC2L;1000GAF=0.00998403;CADD=4.408;Compounds=1:1_880199_G_A>0.9117647058823529|1_880012_A_G>0.8823529411764706|1_880086_T_C>0.8823529411764706|1_880217_T_G>0.9117647058823529;GeneticModels=1:AR_comp|AR_comp_dn;ModelScore=1:55;RankScore=1:6;RankScoreNormalized=1:0.0;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/1:10,10:60 0/0:10,10:60 0/1:10,10:60 0/1:10,10:60 0/0:10,10:60 0/1:10,10:60
1 880012 . A G 100 PASS MQ=1;Annotation=NOC2L;Annotation=NOC2L;1000GAF=0.00119808;CADD=3.326;Compounds=1:1_880199_G_A>0.9117647058823529|1_879911_G_A>0.8823529411764706|1_880086_T_C>0.8823529411764706|1_880217_T_G>0.9117647058823529;GeneticModels=1:AR_comp|AR_comp_dn;ModelScore=1:55;RankScore=1:6;RankScoreNormalized=1:0.0;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/0:10,10:60 0/1:10,10:60 0/1:10,10:60 0/0:10,10:60 0/1:10,10:60 0/1:10,10:60
1 880086 . T C 100 PASS MQ=1;Annotation=NOC2L;Annotation=NOC2L;1000GAF=0.000399361;CADD=0.091;Compounds=1:1_880199_G_A>1.7941176470588234|1_879911_G_A>1.7647058823529411|1_880012_A_G>1.7647058823529411|1_880217_T_G>1.7941176470588234;GeneticModels=1:AD_dn|AR_comp_dn;ModelScore=1:55;RankScore=1:6;RankScoreNormalized=1:0.8823529411764706;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/0:10,10:60 0/0:10,10:60 0/1:10,10:60 0/0:10,10:60 0/0:10,10:60 0/1:10,10:60
1 880199 . G A 100 PASS MQ=1;Annotation=NOC2L;Annotation=NOC2L;CADD=3.450;Compounds=1:1_879911_G_A>1.7941176470588234|1_880012_A_G>1.7941176470588234|1_880086_T_C>1.7941176470588234|1_880217_T_G>1.8235294117647058;GeneticModels=1:AD_dn|AR_comp_dn;ModelScore=1:55;RankScore=1:7;RankScoreNormalized=1:0.9117647058823529;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/0:10,10:60 0/0:10,10:60 0/1:10,10:60 0/0:10,10:60 0/0:10,10:60 0/1:10,10:60
1 880217 . T G 100 PASS MQ=1;Annotation=NOC2L;Annotation=NOC2L;CADD=4.208;Compounds=1:1_880199_G_A>1.8235294117647058|1_879911_G_A>1.7941176470588234|1_880012_A_G>1.7941176470588234|1_880086_T_C>1.7941176470588234;GeneticModels=1:AD_dn|AR_comp_dn;ModelScore=1:55;RankScore=1:7;RankScoreNormalized=1:0.9117647058823529;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/0:10,10:60 0/0:10,10:60 0/1:10,10:60 0/0:10,10:60 0/0:10,10:60 0/1:10,10:60
10 76154051 . A G 100 PASS MQ=1;Annotation=ADK;Annotation=ADK;1000GAF=0.000199681;CADD=9.261;Compounds=1:10_76154073_T_G>0.9705882352941176|10_76154076_G_C>0.9411764705882353;GeneticModels=1:AR_comp_dn;ModelScore=1:55;RankScore=1:6;RankScoreNormalized=1:0.0;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/0:10,10:60 0/1:10,10:60 0/1:10,10:60 0/0:10,10:60 0/1:10,10:60 0/1:10,10:60
10 76154073 . T G 100 PASS MQ=1;Annotation=ADK;Annotation=ADK;CADD=22.4;Compounds=1:10_76154074_C_G>1.4411764705882353|10_76154051_A_G>1.8529411764705883|10_76154076_G_C>1.9117647058823528;GeneticModels=1:AD_dn|AR_comp_dn;ModelScore=1:55;RankScore=1:9;RankScoreNormalized=1:0.9705882352941176;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/0:10,10:60 0/0:10,10:60 0/1:10,10:60 0/0:10,10:60 0/0:10,10:60 0/1:10,10:60
10 76154074 . C G 100 PASS MQ=1;Annotation=ADK;Annotation=ADK;CADD=8.374;RankScore=1:-8;RankScoreNormalized=1:0.47058823529411764;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ ./. 0/1:10,10:60 0/1:10,10:60 0/1:10,10:60 0/1:10,10:60 0/1:10,10:60
10 76154076 . G C 100 PASS MQ=1;Annotation=ADK;Annotation=ADK;CADD=11.65;GeneticModels=1:AD|AD_dn;ModelScore=1:57;RankScore=1:8;RankScoreNormalized=1:0.9411764705882353;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ ./. 0/0:10,10:60 0/1:10,10:60 ./. 0/0:10,10:60 0/1:10,10:60
X 302253 . CCCTCCTGCCCCT C 100 PASS MQ=1;Annotation=PPP2R3B;Annotation=PPP2R3B;1000GAF=0.04;GeneticModels=1:XR|XD;ModelScore=1:55;RankScore=1:4;RankScoreNormalized=1:0.8235294117647058;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/0:10,10:60 0/1:10,10:60 1/1:10,10:60 0/0:10,10:60 1/1:10,10:60 1/1:10,10:60
MT 302253 . CCCTCCTGCCCCT C 100 PASS MQ=1;GeneticModels=1:AR_hom_dn;ModelScore=1:55;RankScore=1:6;RankScoreNormalized=1:0.8823529411764706;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/0:10,10:60 0/1:10,10:60 1/1:10,10:60 0/0:10,10:60 1/1:10,10:60 1/1:10,10:60
Thanks for feedback @henrikstranneheim !
I did rework the compound scoring in genmod/score_variants/compound_scorer.py
to preserve the legacy rank score. In order to better review that code snippet, I suggest you remove whitespace changes (since most of it is just indentation change).
Here's the new output format after compound analysis:
##fileformat=VCFv4.1
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=Annotation,Number=.,Type=String,Description="Annotates what feature(s) this variant belongs to.">
##INFO=<ID=1000GAF,Number=1,Type=Float,Description="Frequency in the 1000G database.">
##INFO=<ID=CADD,Number=1,Type=Integer,Description="The CADD relative score for this alternative.">
##INFO=<ID=GeneticModels,Number=.,Type=String,Description="':'-separated list of genetic models for this variant.">
##INFO=<ID=ModelScore,Number=.,Type=String,Description="PHRED score for genotype models.">
##INFO=<ID=Compounds,Number=.,Type=String,Description="List of compound pairs for this variant.The list is splitted on ',' family id is separated with compoundswith ':'. Compounds are separated with '|'.">
##INFO=<ID=RankScore,Number=.,Type=String,Description="The rank score for this variant in this family. family_id:rank_score.">
##INFO=<ID=RankScoreNormalized,Number=.,Type=String,Description="The normalized rank score in range(0, 1) for this variant in this family. family_id:rank_score.">
##INFO=<ID=RankScoreMinMax,Number=.,Type=String,Description="The rank score MIN-MAX bounds. family_id:min:max.">
##INFO=<ID=CompoundsNormalized,Number=.,Type=String,Description="Rank score as provided by compound analysis, based on RankScoreNormalized. family_id:rank_score">
##contig=<ID=1,length=249250621,assembly=b37>
##reference=file:///humgen/gsa-hpprojects/GATK/bundle/current/b37/human_g1k_v37.fasta
##Software=<ID=genmod,Version=-1.-1.-1,Date="2023-05-08 14:59",CommandLineOptions="context=<click.core.Context object at 0x7efedaea2d10> variant_file=result-annotate.vcf family_file=<_io.TextIOWrapper name='tests/fixtures/recessive_trio.ped' mode='r' encoding='UTF-8'> family_type=ped keyword=Annotation processes=3">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT father mother proband father_2 mother_2 proband_2
1 879537 . T C 100 PASS MQ=1;Annotation=SAMD11,NOC2L;Annotation=SAMD11,NOC2L;1000GAF=0.000199681;CADD=1.248;GeneticModels=1:AR_hom;ModelScore=1:55;RankScore=1:6;RankScoreNormalized=1:0.8823529411764706;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/1:10,10:60 0/1:10,10:60 1/1:10,10:60 0/0:10,10:60 0/1:10,10:60 1/1:10,10:60
1 879541 . G A 100 PASS MQ=1;Annotation=SAMD11,NOC2L;Annotation=SAMD11,NOC2L;1000GAF=0.000599042;CADD=4.003;GeneticModels=1:AR_hom|AR_hom_dn;ModelScore=1:57;RankScore=1:6;RankScoreNormalized=1:0.8823529411764706;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ ./. 0/1:10,10:60 1/1:10,10:60 ./. 0/1:10,10:60 0/1:10,10:60
1 879595 . C T 100 PASS MQ=1;Annotation=SAMD11,NOC2L;Annotation=SAMD11,NOC2L;1000GAF=0.000399361;CADD=8.271;GeneticModels=1:AR_hom_dn;ModelScore=1:55;RankScore=1:6;RankScoreNormalized=1:0.8823529411764706;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/1:10,10:60 0/0:10,10:60 1/1:10,10:60 0/1:10,10:60 0/0:10,10:60 0/1:10,10:60
1 879676 . G A 100 PASS MQ=1;Annotation=SAMD11,NOC2L;Annotation=SAMD11,NOC2L;1000GAF=0.885982;CADD=7.019;RankScore=1:-23;RankScoreNormalized=1:0.029411764705882353;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/1:10,10:60 1/1:10,10:60 1/1:10,10:60 0/1:10,10:60 0/1:10,10:60 0/1:10,10:60
1 879911 . G A 100 PASS MQ=1;Annotation=SAMD11,NOC2L;Annotation=SAMD11,NOC2L;1000GAF=0.00998403;CADD=4.408;Compounds=1:1_880086_T_C>6.0|1_880012_A_G>6.0|1_880199_G_A>7.0|1_880217_T_G>7.0;GeneticModels=1:AR_comp|AR_comp_dn;ModelScore=1:55;RankScore=1:0.0;RankScoreNormalized=1:0.0;RankScoreMinMax=1:-24.0:10.0;CompoundsNormalized=1:1_880086_T_C>0.8823529411764706|1_880012_A_G>0.8823529411764706|1_880199_G_A>0.9117647058823529|1_880217_T_G>0.9117647058823529 GT:AD:GQ 0/1:10,10:60 0/0:10,10:60 0/1:10,10:60 0/1:10,10:60 0/0:10,10:60 0/1:10,10:60
1 880012 . A G 100 PASS MQ=1;Annotation=NOC2L;Annotation=NOC2L;1000GAF=0.00119808;CADD=3.326;Compounds=1:1_880086_T_C>6.0|1_879911_G_A>6.0|1_880199_G_A>7.0|1_880217_T_G>7.0;GeneticModels=1:AR_comp|AR_comp_dn;ModelScore=1:55;RankScore=1:0.0;RankScoreNormalized=1:0.0;RankScoreMinMax=1:-24.0:10.0;CompoundsNormalized=1:1_880086_T_C>0.8823529411764706|1_879911_G_A>0.8823529411764706|1_880199_G_A>0.9117647058823529|1_880217_T_G>0.9117647058823529 GT:AD:GQ 0/0:10,10:60 0/1:10,10:60 0/1:10,10:60 0/0:10,10:60 0/1:10,10:60 0/1:10,10:60
1 880086 . T C 100 PASS MQ=1;Annotation=NOC2L;Annotation=NOC2L;1000GAF=0.000399361;CADD=0.091;Compounds=1:1_879911_G_A>12.0|1_880012_A_G>12.0|1_880199_G_A>13.0|1_880217_T_G>13.0;GeneticModels=1:AD_dn|AR_comp_dn;ModelScore=1:55;RankScore=1:6.0;RankScoreNormalized=1:0.8823529411764706;RankScoreMinMax=1:-24.0:10.0;CompoundsNormalized=1:1_879911_G_A>1.7647058823529411|1_880012_A_G>1.7647058823529411|1_880199_G_A>1.7941176470588234|1_880217_T_G>1.7941176470588234 GT:AD:GQ 0/0:10,10:60 0/0:10,10:60 0/1:10,10:60 0/0:10,10:60 0/0:10,10:60 0/1:10,10:60
1 880199 . G A 100 PASS MQ=1;Annotation=NOC2L;Annotation=NOC2L;CADD=3.450;Compounds=1:1_880086_T_C>13.0|1_879911_G_A>13.0|1_880012_A_G>13.0|1_880217_T_G>14.0;GeneticModels=1:AD_dn|AR_comp_dn;ModelScore=1:55;RankScore=1:7.0;RankScoreNormalized=1:0.9117647058823529;RankScoreMinMax=1:-24.0:10.0;CompoundsNormalized=1:1_880086_T_C>1.7941176470588234|1_879911_G_A>1.7941176470588234|1_880012_A_G>1.7941176470588234|1_880217_T_G>1.8235294117647058 GT:AD:GQ 0/0:10,10:60 0/0:10,10:60 0/1:10,10:60 0/0:10,10:60 0/0:10,10:60 0/1:10,10:60
1 880217 . T G 100 PASS MQ=1;Annotation=NOC2L;Annotation=NOC2L;CADD=4.208;Compounds=1:1_880086_T_C>13.0|1_879911_G_A>13.0|1_880012_A_G>13.0|1_880199_G_A>14.0;GeneticModels=1:AD_dn|AR_comp_dn;ModelScore=1:55;RankScore=1:7.0;RankScoreNormalized=1:0.9117647058823529;RankScoreMinMax=1:-24.0:10.0;CompoundsNormalized=1:1_880086_T_C>1.7941176470588234|1_879911_G_A>1.7941176470588234|1_880012_A_G>1.7941176470588234|1_880199_G_A>1.8235294117647058 GT:AD:GQ 0/0:10,10:60 0/0:10,10:60 0/1:10,10:60 0/0:10,10:60 0/0:10,10:60 0/1:10,10:60
10 76154051 . A G 100 PASS MQ=1;Annotation=ADK;Annotation=ADK;1000GAF=0.000199681;CADD=9.261;Compounds=1:10_76154073_T_G>9.0|10_76154076_G_C>8.0;GeneticModels=1:AR_comp_dn;ModelScore=1:55;RankScore=1:0.0;RankScoreNormalized=1:0.0;RankScoreMinMax=1:-24.0:10.0;CompoundsNormalized=1:10_76154073_T_G>0.9705882352941176|10_76154076_G_C>0.9411764705882353 GT:AD:GQ 0/0:10,10:60 0/1:10,10:60 0/1:10,10:60 0/0:10,10:60 0/1:10,10:60 0/1:10,10:60
10 76154073 . T G 100 PASS MQ=1;Annotation=ADK;Annotation=ADK;CADD=22.4;Compounds=1:10_76154051_A_G>15.0|10_76154076_G_C>17.0|10_76154074_C_G>1.0;GeneticModels=1:AD_dn|AR_comp_dn;ModelScore=1:55;RankScore=1:9.0;RankScoreNormalized=1:0.9705882352941176;RankScoreMinMax=1:-24.0:10.0;CompoundsNormalized=1:10_76154051_A_G>1.8529411764705883|10_76154076_G_C>1.9117647058823528|10_76154074_C_G>1.4411764705882353 GT:AD:GQ 0/0:10,10:60 0/0:10,10:60 0/1:10,10:60 0/0:10,10:60 0/0:10,10:60 0/1:10,10:60
10 76154074 . C G 100 PASS MQ=1;Annotation=ADK;Annotation=ADK;CADD=8.374;RankScore=1:-8;RankScoreNormalized=1:0.47058823529411764;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ ./. 0/1:10,10:60 0/1:10,10:60 0/1:10,10:60 0/1:10,10:60 0/1:10,10:60
10 76154076 . G C 100 PASS MQ=1;Annotation=ADK;Annotation=ADK;CADD=11.65;GeneticModels=1:AD|AD_dn;ModelScore=1:57;RankScore=1:8;RankScoreNormalized=1:0.9411764705882353;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ ./. 0/0:10,10:60 0/1:10,10:60 ./. 0/0:10,10:60 0/1:10,10:60
X 302253 . CCCTCCTGCCCCT C 100 PASS MQ=1;Annotation=PPP2R3B;Annotation=PPP2R3B;1000GAF=0.04;GeneticModels=1:XR|XD;ModelScore=1:55;RankScore=1:4;RankScoreNormalized=1:0.8235294117647058;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/0:10,10:60 0/1:10,10:60 1/1:10,10:60 0/0:10,10:60 1/1:10,10:60 1/1:10,10:60
MT 302253 . CCCTCCTGCCCCT C 100 PASS MQ=1;GeneticModels=1:AR_hom_dn;ModelScore=1:55;RankScore=1:6;RankScoreNormalized=1:0.8823529411764706;RankScoreMinMax=1:-24.0:10.0 GT:AD:GQ 0/0:10,10:60 0/1:10,10:60 1/1:10,10:60 0/0:10,10:60 1/1:10,10:60 1/1:10,10:60
genmod/score_variants/compound_scorer.py::get_rank_score_as_magnitude()
@dnil
I'll think of some tests for rule bound logic and add them this afternoon.
tests/score_variants/test_category_score.py
) I think this looks great as a first step +1. Could you post some tests from Hasta stage using real data to show a real life test. And then let's merge!
Sure thing will do @henrikstranneheim
I find this very interesting. I haven't had a look yet to the conversation here but is there any documentation somewhere?
Documentation related to this PR is now merged and available in https://github.com/Clinical-Genomics/rdds/blob/v1.0.1/src/rdds/exploration_rankscore/README-genmod-v0.0.0.md.
I suggest the next step is to merge this PR, draft a new release of Genmod-MIP and run MIP validation suite? @henrikstranneheim @jemten Do you agree?
I agree 😄, and massive work Tor
👍 Well done!
Rank Score Normalization
Additions to scored VCF file:
TODOs
Current tests in this repo does not cover all plugins, categories used by MIP pipeline. Determine a strategy for testing this PR.
How to prepare for test:
ssh
to ...bash servers/resources/SERVER.scilifelab.se/update-[THIS_TOOL]-stage.sh [THIS-BRANCH-NAME]
How to test:
make test
but this does not cover CompoundScore.Expected outcome:
Review:
This version is a: