etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
540 stars 165 forks source link

BAF doesn't seem to be correct #407

Open weishwu opened 5 years ago

weishwu commented 5 years ago

I concatenated the germline and LOH variants called by VarScan and used that in cnvkit call. A 43Mb region was called as a deletion with BAF=0. However, by looking at the SNPs that are 0/1 in normal sample in this region, 60% changed to 0/0 or 1/1, while 40% remained as 0/1, based on the genotype calling from VarScan. How can the average BAF become 0 in this region?

etal commented 5 years ago

I agree that doesn't seem correct. Could you share the relevant portion of this .cnr file?

One possibility is that the SNPs in this segment were rejected by CNVkit for some reason, based on the contents of the VCF records there.

weishwu commented 5 years ago

Thanks for replying quickly.

The .cnr, .cns and the vcf files can be downloaded here: https://usegalaxy.org/u/mapped%20reads%20in%20illumina%2011/h/cnvkitdata

The vcf contains the variants called by GATK HaplotypeCaller from tumor and normal samples together. Mutect2 was used to call somatic variants and I compared these two and removed somatic variants.

My command-line is:

# batch processing with all samples
cnvkit batch *_Tumor.bam --normal *_Normal.bam --method wgs -p 1 --fasta ${genome_fasta} --annotate ${gene_anno} --output-reference C57BL6NJ_reference.cnn --scatter --diagram --output-dir cnvkit_call/
# redo segmentation with sample 6665, by using a vcf for LOH detection
cnvkit call Sample_6665_recal_Tumor.cns -y -v Rhim_CU44_HC_SnpIndel_PASS_DP8GQ20_sample_6665_nonSomatic.vcf -i sample_6665 -n C57BL_6NJ -z 0.25 -o Sample_6665_recal_Tumor_withHCnonSomatic_recall.cns

The output file is "Sample_6665_recal_Tumor_withHCnonSomatic_recall_fixed.cns", in which chr1 is mostly called as cn=0 with baf=0.

bwubb commented 1 month ago

Hello,

I am also getting baf values of 0 when using varscan2 germline/loh calls. It happens for every tumor/normal pair in every segment. I do not get any errors or warnings

>cnvkit.py segment T448-03.cnr -v varscan2.snps.clean.vcf -i T448-03 -n 10111 -o T448-03.cns
Selected test sample T448-03 and control sample 10111
Loaded 27024 records; skipped: 0 somatic, 5839 depth
Skipping 10 additional somatic records based on T/N genotypes
Kept 15987 heterozygous of 27014 VCF records
Segmenting with method 'cbs', significance threshold 0.0001, in 1 processes
Re-segmenting on variant allele frequency
Done, now finalizing
Re-segmenting on variant allele frequency
Done, now finalizing
Re-segmenting on variant allele frequency
...

Seems like it is working, but the results do not carry to loh calls.

chromosome      start   end     log2    baf     depth   probes  weight
1       12145   121884666       -0.0538616      0       74.6717 17653   13987.6
1       143279634       145608570       -0.0667195      0       71.8267 369     252.614
1       145609070       146065793       0.164872        0       89.5717 185     149.673
1       146066293       148586512       -0.0188062      0       76.2323 549     387.678
1       148587956       248945922       0.114365        0       75.6463 14322   11366.8
2       10500   92110323        0.000133892     0       53.6863 9954    7901.54

Happy to share vcf information too, but let me just say that they are all PASS snps with lots and lots of 0/1 positions in the normal. And all of the tumor and normal names match exactly.

Is there a specific FORMAT column that is required? GT is present, but there is not a AF column.

Thank you.

bwubb commented 1 month ago

Oh wow, it is the missing AF column I think. I tried a different caller, VarDict, which has an AF column. The stdout changed to reflect breakpoints and the baf column is populated.