PharmGKB / PharmCAT

The Pharmacogenomic Clinical Annotation Tool
Mozilla Public License 2.0
120 stars 39 forks source link

Generation of test vcfs by allele processor #1

Closed lester-pharmgkb closed 3 years ago

lester-pharmgkb commented 8 years ago

Currently I am making test case vcf files by hand to test the haplotype caller. It would be useful if the tsv allele processor step was able to generate a large volume of test vcfs from the data it is already processing. At the very minimum it could autogenerate the _1_1 allele vcfs. It would be better if it could use an input file with a gene and diplotype list such as: CYP2C19 _1/_1 CYP2C19 _1/_4 CYP2C9 _1/_1 CYP2C9 _2/_3

etc.

whaleyr commented 6 years ago

This is still a good idea: dynamic test VCF file generation that uses the allele definitions along with specified mutations to positions. I probably won't get to it on the initial release but I want to keep it around for the next release.

whaleyr commented 3 years ago

I totally forgot we had this issue from 5 years ago. This is currently being worked on by @atfrase so I'm going to assign this to him. I'm adding it to v1.0 milestone so we know it needs to be closed out before we release v1.0.

atfrase commented 3 years ago

An initial draft script for this was committed in https://github.com/PharmGKB/PharmCAT/commit/848ecd074f40a79b9db5ecdcce85e94a6d8c0d26.

Some quick usage notes:

test_gen.py takes two positional arguments: a gene's definition JSON file, and a directory to write test VCFs into. An example run looks like:

$ ./test_gen.py ../main/resources/org/pharmgkb/pharmcat/definition/alleles/CYP3A5_translation.json ./tests/CYP3A5/
Loading '../main/resources/org/pharmgkb/pharmcat/definition/alleles/CYP3A5_translation.json' ... 
done: 8 variants, 9 named alleles

Scanning named alleles ... 
done

Checking nucleic code notations ... 
done: 14 possible unknown alleles

Generating test cases ... 
done: 20 tests

Writing files...
Done: 50 files

Test cases are grouped by their expected haplotype call and written to one VCF file per call, with the ##PharmCATnamedAlleles meta header containing the expected call. If multiple tests yield the same expected call, they appear as additional samples in the VCF. In this example:

$ ls ./tests/CYP3A5/
CYP3A5_s1_noCall1.2.vcf  CYP3A5_s1_s9.vcf  CYP3A5_s3_s7.vcf  CYP3A5_s5_s9.vcf
CYP3A5_s1_noCall2.3.vcf  CYP3A5_s2_s2.vcf  CYP3A5_s3_s8.vcf  CYP3A5_s6_s6.vcf
CYP3A5_s1_noCall2.4.vcf  CYP3A5_s2_s3.vcf  CYP3A5_s3_s9.vcf  CYP3A5_s6_s7.vcf
CYP3A5_s1_noCall2.8.vcf  CYP3A5_s2_s4.vcf  CYP3A5_s4_s4.vcf  CYP3A5_s6_s8.vcf
CYP3A5_s1_noCall.vcf     CYP3A5_s2_s5.vcf  CYP3A5_s4_s5.vcf  CYP3A5_s6_s9.vcf
CYP3A5_s1_s1.vcf         CYP3A5_s2_s6.vcf  CYP3A5_s4_s6.vcf  CYP3A5_s7_s7.vcf
CYP3A5_s1_s2.vcf         CYP3A5_s2_s7.vcf  CYP3A5_s4_s7.vcf  CYP3A5_s7_s8.vcf
CYP3A5_s1_s3.vcf         CYP3A5_s2_s8.vcf  CYP3A5_s4_s8.vcf  CYP3A5_s7_s9.vcf
CYP3A5_s1_s4.vcf         CYP3A5_s2_s9.vcf  CYP3A5_s4_s9.vcf  CYP3A5_s8_s8.vcf
CYP3A5_s1_s5.vcf         CYP3A5_s3_s3.vcf  CYP3A5_s5_s5.vcf  CYP3A5_s8_s9.vcf
CYP3A5_s1_s6.vcf         CYP3A5_s3_s4.vcf  CYP3A5_s5_s6.vcf  CYP3A5_s9_s9.vcf
CYP3A5_s1_s7.vcf         CYP3A5_s3_s5.vcf  CYP3A5_s5_s7.vcf
CYP3A5_s1_s8.vcf         CYP3A5_s3_s6.vcf  CYP3A5_s5_s8.vcf

$ cat ./tests/CYP3A5/CYP3A5_s1_s3.vcf
##fileformat=VCFv4.3
##fileDate=20210308
##reference=hg38
##PharmCATnamedAlleles=*1/*3
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TEST1   TEST2
chr7    99652613    rs28365083  G   .   .   PASS    .   GT  0/0 0/0
chr7    99652770    rs41303343  .   .   .   PASS    .   GT  ./. ./.
chr7    99660516    rs28383479  C   .   .   PASS    .   GT  0/0 0/0
chr7    99665212    rs10264272  C   .   .   PASS    .   GT  0/0 0/.
chr7    99665237    rs56411402  T   C   .   PASS    .   GT  0/1 0/1
chr7    99666950    rs55965422  A   .   .   PASS    .   GT  0/0 0/0
chr7    99672916    rs776746    T   C   .   PASS    .   GT  0/1 0/1
chr7    99676198    rs55817950  G   .   .   PASS    .   GT  0/0 0/.

The categories of test cases currently generated are:

whaleyr commented 3 years ago

I'm leaving this issue to @scdudek and @markwoon to manage. Please close when you, Katrin, and Michelle are satisfied with the testing.

whaleyr commented 3 years ago

I'm closing this since we've released v1.0 and the test generation is working. I know there's still more to improve but, at this point, those should be in more specific issues.