etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
547 stars 166 forks source link

LOH VCF file #10

Closed pascalg closed 9 years ago

pascalg commented 9 years ago

The LOH command is not working for me, it doesn't seem to like my VarScan VCF file. Any ideas?

Thanks

cnvkit.py scatter Sample.cnr -s Sample.cns -v VarScan.snp.Germline.vcf
Traceback (most recent call last):
  File "/usr/local/bin/cnvkit.py", line 4, in <module>
    __import__('pkg_resources').run_script('CNVkit==0.3.5-dev', 'cnvkit.py')
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 534, in run_script
    if req in processed:
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1445, in run_script

  File "/usr/local/lib/python2.7/dist-packages/CNVkit-0.3.5_dev-py2.7.egg/EGG-INFO/scripts/cnvkit.py", line 9, in <module>

  File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 714, in _cmd_scatter

  File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 734, in do_scatter

  File "build/bdist.linux-x86_64/egg/cnvlib/ngfrills/vcf.py", line 13, in load_vcf
    # 'format': list of keys (strings)
  File "build/bdist.linux-x86_64/egg/cnvlib/ngfrills/vcf.py", line 33, in filter_vcf_lines
    # When instantiated, a reference can be passed to the VCF class.  This may be any class that supports a
KeyError: 'AF'
etal commented 9 years ago

It looks like the "AF" (allele frequency, e.g. 0.5 or 1) INFO field is not emitted by VarScan, so I'll have to modify this part of CNVkit to handle it.

Could you post a couple of lines from your VarScan VCF file here? I don't need the whole thing, just a few lines and ideally the header too, if you can.

pascalg commented 9 years ago

Thanks, here it is:

##fileformat=VCFv4.1
##source=VarScan2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth of quality bases">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Indicates if record is a somatic mutation">
##INFO=<ID=SS,Number=1,Type=String,Description="Somatic status of variant (0=Reference,1=Germline,2=Somatic,3=LOH, or 5=Unknown)">
##INFO=<ID=SSC,Number=1,Type=String,Description="Somatic score in Phred scale (0-255) derived from somatic p-value">
##INFO=<ID=GPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor+normal versus no variant for Germline calls">
##INFO=<ID=SPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor versus normal for Somatic/LOH calls">
##FILTER=<ID=str10,Description="Less than 10% or more than 90% of variant supporting reads on one strand">
##FILTER=<ID=indelError,Description="Likely artifact due to indel reads at this position">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting bases (reads2)">
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
##FORMAT=<ID=DP4,Number=1,Type=String,Description="Strand read counts: ref/fwd, ref/rev, var/fwd, var/rev">
CHROM,FROM,REF,ALT,-,-,-,FILTER,-
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NORMAL  TUMOR
1   69511   .   A   G   .   PASS    DP=50;SS=1;SSC=0;GPV=9.9117E-30;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:20:0:20:100%:0,0,15,5 1/1:.:30:0:30:100%:0,0,23,7
1   139213  .   A   G   .   PASS    DP=19;SS=1;SSC=5;GPV=5.9414E-9;SPV=2.6316E-1    GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:9:0:9:100%:0,0,3,6    1/1:.:10:2:8:80%:1,1,1,7
1   762273  .   G   A   .   PASS    DP=109;SS=1;SSC=0;GPV=4.3979E-65;SPV=1E0    GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:54:0:54:100%:0,0,8,46 1/1:.:55:0:55:100%:0,0,9,46
etal commented 9 years ago

I've just merged a pull request (#11) that should fix this. Can you try it now?

pascalg commented 9 years ago

Thanks, it is working now. However, I am getting different chromosomes with significant LOH shift when I use "scatter" or "loh" with the same VCF file. Shouldn't the result be the same? From the plot I think "scatter" is wrong as I cannot see why it marks the chrosomome significant (all VAFs <0.8).

etal commented 9 years ago

I agree with your assessment of it. I'll take a closer look.

etal commented 9 years ago

The loh command had an option --min-depth which filtered out SNVs with depth < 20, while scatter was missing this option and did no filtering. Fixed in: bbe9af2c16410693dd71d946806388044e7801d6

The deeper problem was that the test statistic I used was garbage, so I disabled it here: b339ca1d80ac0a07424dbbd3191f1d70a957afa9

I will at some point implement a better test for LOH and then re-enable reporting and colorization in both plots.