etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
540 stars 165 forks source link

negative BAF values estimating LOH #601

Open john-alexander opened 3 years ago

john-alexander commented 3 years ago

I'm trying to estimate LOH in cnvkit using purity, ploidy estimates from sequenza.

I use cnvkit call option with its default thresholds,

conda activate **cnvkit0.9.7**
cnvkit.py call $tumour_id.recalibrated.cns -y -v $mutect_tumour_dir/$tumour_id/$tumour_id.passed.vcf.gz --purity $purity --ploidy $ploidy -o $tumour_id.call.cns

The output gives me -ve BAF values which im not sure how to interpret

  chromosome    start      end        gene                                              log2       baf  cn  cn1 cn2   depth probes  weight
chr16 66669013 69943499 CMTM4,DYNC1LI2,DYNC1LI2,LOC106699570,LOC106699570,LOC106699570,TERB1  -0.871623 -0.211176  1   1   0 71.7805   2818 2418.18

To confirm are LOH regions where cn1!=cn2 and BAF=0. Also what about estimating copy number netural LOH.

Thank you.

tskir commented 3 years ago

Welp, that doesn't seem good.

To debug this further, I'm afraid I'm going to need a minimal reproducible example with the files & the exact parameters you're using. If these are private data, feel free to only share the relevant chunks and/or to randomise some values, as long as the bug (the negative BAFs) can still be reproduced. Do you think you could provide something like that?

john-alexander commented 3 years ago

I'm sending private files via zendto (ktsukanov [at] ebi.ac.uk)

Command used:

conda activate cnvkit0.9.7
cnvkit.py call tumour_AB.recalibrated.cns -y -m threshold -t=-1,-0.3,0.3,1 -v tumour_AB.passed.vcf.gz --purity 0.37 --ploidy 2 -o tumour_AB.call.cns

Thank you very much.

tskir commented 3 years ago

@john-alexander Thank you, files received! I'll now look into that more closely

john-alexander commented 3 years ago

@tskir thank you for looking into this -wondering if there's a fix/update on this soon. Best, J

tskir commented 3 years ago

Hi @john-alexander, thank you for waiting. I was now able to take a look into this. As far as I can see, the problem has two sides to it.

First, the cnvkit.py call command you are using does not specify the tumour and normal sample IDs (--sample-id, --normal-id). Because of this, only the first sample is used and the whole VCF is treated as unpaired tumour sample with no normal.

Secondly, and much more importantly, the VCF file which you provided appears to only contain somatic variants: 0/1 in tumour and 0/0 in normal.

In contrast, what CNVkit requires for allele-specific copy number estimation are heterozygous, non-somatic variants. Unfortunately, this is only mentioned in the documentation in passing: see here, the fourth paragraph.

Because of these two factors, each variant from the first sample is assumed to be germline and is expected to appear in the normal sample as well; and math for adjusting BAFs consequently breaks down.

Could you try re-running cnvkit.py call with a different callset of only non-somatic variation (if you have it), and also specifying the sample IDs, and see if things are better?

etal commented 3 years ago

Further info on how to prepare a VCF for BAF calculation and LOH detection:

As @tskir notes in #616 , the documentation could stand to be more clear on this topic.

Nickier0510 commented 3 years ago

@etal Hi,I saw PureCN's suggestion to add these two parameters --genotype-germline-sites true --genotype-pon-sites true when running Mutect2 to obtain germline snp. So, do you mean that the vcf here can also be obtained by this method?

etal commented 3 years ago

I haven't tried those options myself but the names look like they would do what CNVkit needs, i.e. keep the germline SNP sites in the output VCF.

bwubb commented 9 months ago

Hello,

Im having a similar issue; negative BAF values and everything in cn2 is 0.

I did not appear to have any issues generating the cnvkit calls. I ran PureCN. I could not get any of the Mutect or Mutect2 vcfs to work, even with the genotype-germline arguments, but I got results with VarScan2 output.

I can confirm I have both somatic and germline/loh calls from varscan2 in my vcf input.

(python3.7) (base) [bwubb@cybertron cnvkit]$ cnvkit.py call results/2005121651-2.cns -i 2005121651-2 -n normal -x female -m clonal -v ../data/work/S31285117/2005121651-2/varscan2/2005121651-2.fpfilter.norm.clean.vcf.gz --ploidy 2 --purity 0.29 -o 2005121651-2.call.cns
Selected test sample 2005121651-2 and control sample normal
Loaded 36888 records; skipped: 2594 somatic, 738 depth
Skipping 305 additional somatic records based on T/N genotypes
Kept 21905 heterozygous of 36583 VCF records
Treating sample 2005121651-2 as female
Rescaling sample with purity 0.29, ploidy 2
Wrote 2005121651-2.call.cns with 416 regions
(python3.7) (base) [bwubb@cybertron cnvkit]$ awk ' {print $6,$9,$10,$11 }' 2005121651-2.call.cns | head
baf cn cn1 cn2
-1.22414 2 2 0
-1.22414 -1 -1 0
-1.22414 3 3 0
-1.22414 7 7 0
-1.22414 2 2 0
0.384641 223.63 39 37.0472
-1.22414 3 3 0
-1.22414 1 1 0
-1.22414 0 0 0

The results definitely look suspicious to say the least. Has anyone had any success regarding this issue? Im not sure what more I can test or look for. Still trying to figure out what cnvkit doesnt like about my vcf file.

bwubb commented 1 month ago

I tested with a vcf from a different caller, VarDict, and I was able to properly generate baf values. Im not certain why it worked; I do not know what cnvKit is looking for in its vcf. I though it was an AF value in FORMAT (which VarScan2 output lacks), but if I manually calculate and add in a AF value for each variant (yes I properly formatted the header and everything) it still produces baf of 0 from VarScan2 output.

Im sure Im missing something...