etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
545 stars 165 forks source link

Missing points in VAF plot #290

Open micknudsen opened 6 years ago

micknudsen commented 6 years ago

Maybe I am getting this all wrong; this could very well be a feature and not a bug. I have created a plot using the scatter command:

cnvkit.py scatter test.cnr -s test.cns -v test.vcf.gz -c chr20:19569338-21565309 -o plot.pdf

plot

There appears to be a loss (and hence LOH) in part of the RALGAPA2 gene. In fact, there is a homozygote somatic variant in the area not segmented in the VAF plot supporting the hypothesis:

$ bedtools intersect -a test.vcf.gz -b <(echo -e "chr20\t20500000\t20750000")
chr20   20572016    rs55967804  A   C   33954.93    PASS    AC=2;AF=1.00;AN=2;BaseQRankSum=2.00;ClippingRankSum=0.107;DB;DP=25;ExcessHet=0.2098;FS=1.244;InbreedingCoeff=0.0795;MQ=54.25;MQRankSum=-2.300e-02;POSITIVE_TRAIN_SITE;QD=12.85;ReadPosRankSum=0.578;SOR=0.796;VQSLOD=3.90;culprit=DP    GT:AD:DP:GQ:PL  1/1:1,24:25:38:898,38,0

However, there is no corresponding dot in the VAF plot. I have tried fiddling around with the code and extended the y-axis, but there is still no dot. Is that the expected behavior?

I have also tried --min-variant-depth 1, but there is still no dot.

etal commented 6 years ago

This is expected behavior, where only variants that appear to be germline heterozygous SNPs are kept:

I also agree there is some room for improvement. Let me know if either of these appeal to you:

micknudsen commented 6 years ago

Thanks for the detailed information. Looks like the scatter function is intended to do something different than I inspected.

As part of a larger setup, I am using CNVkit to call somatic copy number variants in paired tumor-normal samples, and in areas with variations, like the RALGAPA2 gene in the plot above, I would like to know if there is LOH.

To do that, I take the tumor sample and call all variants against the reference (not the paired normal). The resulting VCF thus contains both somatic and germline variants, and that is the one I'm using as the -v input.

Summing up, I believed that the default scatter behavior was more like the second of your suggested features (except the different colors):

Homozygous variants from a tumor-only VCF could be plotted in a different color, e.g. small black points (and still excluded from the segment overlay).

The code base is rather big, and I am afraid that I am going to break something. Could you point me to the place where the filtering of homozygote variants is performed? I would like to have a go at implementing an --include-homozygote-snvs option and see how that turns out.

etal commented 6 years ago

Sure. Your VCF is fine for use with CNVkit, it's just that CNVkit is missing a feature that would be useful here.

To get the plot you need, I recommend just commenting out this line in cnvlib.cmdutil.load_het_snps: https://github.com/etal/cnvkit/blob/master/cnvlib/cmdutil.py#L42

And regenerate your plot. The segment lines in the lower panel (VAF) might be less useful, but you should be able to see the homozygote variants that were lost before.

etal commented 6 years ago

Here's guidance on T/N variant calling to get the most from CNVkit:

If you do paired calling with e.g. FreeBayes or VarDict, keeping both samples including SNPs while also marking somatic variants, you should be able to see the RALGAPA2 variant of interest while also minimizing other noise in the VAF profile.

micknudsen commented 6 years ago

Thanks for your assistance. As part of my workflow, I perform paired T/N calling using MuTect2, but since that only spits out somatic variants (and some dubious ones tagged germline_risk, which could be germline), I am missing out on a lot of informative points in the VAF plot. I am interested in spotting LOH and therefore want to use all variants. That is why I separately run HaplotypeCaller on the tumor sample alone (against the reference genome) and use the resulting VCF file for VAF plotting.

I tried using the unfiltered MuTect2 VCF file, but it just gives me an empty VAF plot. Even after adding SOMATIC in the INFO field for samples that PASS filters, the plot is still empty.

zcat mutect2.vcf.gz | sed 's/PASS\t/PASS\tSOMATIC;/g' > mutect2somatic.vcf

$ grep ^chr20 mutect2somatic.vcf | head
chr20   147556  .   A   G   .   t_lod_fstar ECNT=1;HCNT=6;MAX_ED=.;MIN_ED=.;NLOD=5.67;TLOD=5.18 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1  0/0:19,0:0.00:0:0:.:639,0:9:10  0/1:31,2:0.061:1:1:0.500:1038,75:16:15
chr20   156165  rs148702419 C   G   .   clustered_events;germline_risk  DB;ECNT=2;HCNT=1;MAX_ED=8;MIN_ED=8;NLOD=0.602;TLOD=11.69    GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:PGT:PID:QSS:REF_F1R2:REF_F2R1  0/0:4,0:0.00:0:0:.:0|1:156165_C_G:145,0:2:2 0/1:3,5:0.500:3:2:0.400:0|1:156165_C_G:83,174:3:0
chr20   156173  rs6141010   C   T   .   clustered_events;germline_risk  DB;ECNT=2;HCNT=1;MAX_ED=8;MIN_ED=8;NLOD=0.602;TLOD=11.69    GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:PGT:PID:QSS:REF_F1R2:REF_F2R1  0/0:4,0:0.00:0:0:.:0|1:156165_C_G:132,0:2:2 0/1:4,4:0.500:3:1:0.250:0|1:156165_C_G:130,140:4:0
chr20   464725  .   A   C   .   multi_event_alt_allele_in_normal    ECNT=2;HCNT=20;MAX_ED=1;MIN_ED=1;NLOD=7.57;TLOD=16.10   GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:PGT:PID:QSS:REF_F1R2:REF_F2R1  0/0:71,2:0.036:2:0:0.00:0|1:464725_A_C:2423,69:28:43    0/1:81,8:0.097:7:1:0.125:0|1:464725_A_C:2748,274:39:42
chr20   464726  .   A   ATCACATT    .   multi_event_alt_allele_in_normal    ECNT=2;HCNT=20;MAX_ED=1;MIN_ED=1;NLOD=7.56;TLOD=18.58   GT:AD:AF:ALT_F1R2:ALT_F2R1:PGT:PID:QSS:REF_F1R2:REF_F2R1    0/0:70,2:0.036:2:0:0|1:464725_A_C:2440,64:28:42 0/1:80,5:0.109:4:1:0|1:464725_A_C:2822,150:39:41
chr20   489174  .   T   A   .   PASS    SOMATIC;ECNT=1;HCNT=6;MAX_ED=.;MIN_ED=.;NLOD=20.68;TLOD=6.54    GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1  0/0:96,0:0.00:0:0:.:3277,0:50:46    0/1:115,4:0.034:2:2:0.500:3967,143:55:60
chr20   832703  rs146792691 ATG A   .   germline_risk;str_contraction   DB;ECNT=1;HCNT=29;MAX_ED=.;MIN_ED=.;NLOD=0.502;RPA=8,7;RU=TG;STR;TLOD=11.25 GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1    0/0:36,3:0.100:1:2:1134,99:14:22    0/1:90,8:0.091:2:6:2728,246:57:33
chr20   2410766 rs2076650   T   C   .   germline_risk   DB;ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=1.20;TLOD=12.23 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1  0/0:4,0:0.00:0:0:.:140,0:2:2    0/1:0,3:1.00:1:2:0.333:0,108:0:0
chr20   2844651 .   G   T   .   t_lod_fstar ECNT=1;HCNT=8;MAX_ED=.;MIN_ED=.;NLOD=14.14;TLOD=4.19    GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1  0/0:61,0:0.00:0:0:.:2100,0:30:31    0/1:66,3:0.035:3:0:1.00:2307,100:28:38
chr20   2945386 .   G   A   .   t_lod_fstar ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=6.23;TLOD=4.46 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1  0/0:22,0:0.00:0:0:.:725,0:11:11 0/1:28,2:0.067:1:1:0.500:974,67:12:16

$ cnvkit.py scatter test.cnr -s test.cns -v mutect2somatic.vcf -c chr20 -o plot.pdf

plot

Everything works great when I comment out the suggested line. I will go ahead an maintain a private fork for time being. I am using CNVkit in a clinical setting, and the molecular biologists analyzing the output are screaming for more more dots in the scatter plot 😃

etal commented 6 years ago

Yep, Mutect2 is completely useless for this type of analysis and there's no way to get it to emit the useful SNP sites. HaplotypeCaller or another caller designed for germline calling (in addition to tumor calling) is the way to go.

I'll leave this ticket open while working on the two enhancements I mentioned above, plotting somatic or tumor-homozygous variants in another color.