Closed lester-pharmgkb closed 3 years ago
This is still a good idea: dynamic test VCF file generation that uses the allele definitions along with specified mutations to positions. I probably won't get to it on the initial release but I want to keep it around for the next release.
I totally forgot we had this issue from 5 years ago. This is currently being worked on by @atfrase so I'm going to assign this to him. I'm adding it to v1.0 milestone so we know it needs to be closed out before we release v1.0.
An initial draft script for this was committed in https://github.com/PharmGKB/PharmCAT/commit/848ecd074f40a79b9db5ecdcce85e94a6d8c0d26.
Some quick usage notes:
test_gen.py
takes two positional arguments: a gene's definition JSON file, and a directory to write test VCFs into. An example run looks like:
$ ./test_gen.py ../main/resources/org/pharmgkb/pharmcat/definition/alleles/CYP3A5_translation.json ./tests/CYP3A5/
Loading '../main/resources/org/pharmgkb/pharmcat/definition/alleles/CYP3A5_translation.json' ...
done: 8 variants, 9 named alleles
Scanning named alleles ...
done
Checking nucleic code notations ...
done: 14 possible unknown alleles
Generating test cases ...
done: 20 tests
Writing files...
Done: 50 files
Test cases are grouped by their expected haplotype call and written to one VCF file per call, with the ##PharmCATnamedAlleles
meta header containing the expected call. If multiple tests yield the same expected call, they appear as additional samples in the VCF. In this example:
$ ls ./tests/CYP3A5/
CYP3A5_s1_noCall1.2.vcf CYP3A5_s1_s9.vcf CYP3A5_s3_s7.vcf CYP3A5_s5_s9.vcf
CYP3A5_s1_noCall2.3.vcf CYP3A5_s2_s2.vcf CYP3A5_s3_s8.vcf CYP3A5_s6_s6.vcf
CYP3A5_s1_noCall2.4.vcf CYP3A5_s2_s3.vcf CYP3A5_s3_s9.vcf CYP3A5_s6_s7.vcf
CYP3A5_s1_noCall2.8.vcf CYP3A5_s2_s4.vcf CYP3A5_s4_s4.vcf CYP3A5_s6_s8.vcf
CYP3A5_s1_noCall.vcf CYP3A5_s2_s5.vcf CYP3A5_s4_s5.vcf CYP3A5_s6_s9.vcf
CYP3A5_s1_s1.vcf CYP3A5_s2_s6.vcf CYP3A5_s4_s6.vcf CYP3A5_s7_s7.vcf
CYP3A5_s1_s2.vcf CYP3A5_s2_s7.vcf CYP3A5_s4_s7.vcf CYP3A5_s7_s8.vcf
CYP3A5_s1_s3.vcf CYP3A5_s2_s8.vcf CYP3A5_s4_s8.vcf CYP3A5_s7_s9.vcf
CYP3A5_s1_s4.vcf CYP3A5_s2_s9.vcf CYP3A5_s4_s9.vcf CYP3A5_s8_s8.vcf
CYP3A5_s1_s5.vcf CYP3A5_s3_s3.vcf CYP3A5_s5_s5.vcf CYP3A5_s8_s9.vcf
CYP3A5_s1_s6.vcf CYP3A5_s3_s4.vcf CYP3A5_s5_s6.vcf CYP3A5_s9_s9.vcf
CYP3A5_s1_s7.vcf CYP3A5_s3_s5.vcf CYP3A5_s5_s7.vcf
CYP3A5_s1_s8.vcf CYP3A5_s3_s6.vcf CYP3A5_s5_s8.vcf
$ cat ./tests/CYP3A5/CYP3A5_s1_s3.vcf
##fileformat=VCFv4.3
##fileDate=20210308
##reference=hg38
##PharmCATnamedAlleles=*1/*3
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TEST1 TEST2
chr7 99652613 rs28365083 G . . PASS . GT 0/0 0/0
chr7 99652770 rs41303343 . . . PASS . GT ./. ./.
chr7 99660516 rs28383479 C . . PASS . GT 0/0 0/0
chr7 99665212 rs10264272 C . . PASS . GT 0/0 0/.
chr7 99665237 rs56411402 T C . PASS . GT 0/1 0/1
chr7 99666950 rs55965422 A . . PASS . GT 0/0 0/0
chr7 99672916 rs776746 T C . PASS . GT 0/1 0/1
chr7 99676198 rs55817950 G . . PASS . GT 0/0 0/.
The categories of test cases currently generated are:
I'm leaving this issue to @scdudek and @markwoon to manage. Please close when you, Katrin, and Michelle are satisfied with the testing.
I'm closing this since we've released v1.0 and the test generation is working. I know there's still more to improve but, at this point, those should be in more specific issues.
Currently I am making test case vcf files by hand to test the haplotype caller. It would be useful if the tsv allele processor step was able to generate a large volume of test vcfs from the data it is already processing. At the very minimum it could autogenerate the _1_1 allele vcfs. It would be better if it could use an input file with a gene and diplotype list such as: CYP2C19 _1/_1 CYP2C19 _1/_4 CYP2C9 _1/_1 CYP2C9 _2/_3
etc.