etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
559 stars 166 forks source link

Clarification on parameters for calling #584

Open vymao opened 3 years ago

vymao commented 3 years ago

Hi,

I am trying to call allelic CN values. I have a Mutect2 VCF, but I also have a curated list of germline SNVs that I got by filtering the Mutect2 VCF. I am wondering how to properly use either in the calling.

For example, I see that cnvkit.py call has these parameters:

  -i SAMPLE_ID, --sample-id SAMPLE_ID
                        Name of the sample in the VCF (-v/--vcf) to use for
                        b-allele frequency extraction.
  -n NORMAL_ID, --normal-id NORMAL_ID
                        Corresponding normal sample ID in the input VCF
                        (-v/--vcf). This sample is used to select only
                        germline SNVs to calculate b-allele frequencies.

Are we meant to use both flags? I am confused because it seems that using --sample-id will use all the SNVs in a file, whereas --normal-id will selectively use some SNVs, but I am not sure how. Some clarification would be very helpful.

tskir commented 3 years ago

Hi @vymao! To be honest I've never used CNVkit for tumour CNV calling in practice. The following advice is theoretical and is based on investigating the source code, so perhaps take it with a grain of salt.

When both SAMPLE_ID and NORMAL_ID are provided, all VCF records are used for both of them. CNVkit looks at certain FORMAT fields (DP, AD, GT) and deduces whether a given variant is present in tumour and/or normal samples from those fields.

The way I understand it, the intended use for those parameters is to have two separate samples sequenced (correspondingly, tumour and normal from the same patient) and to supply CNVkit with a joint callset.

If you are certain that your filtered Mutect2 callset is a reasonable substitution for an actual germline sample (but don't rely on me here, as I'm not an expert in tumour genetics), then you should combine your data into one VCF with two samples. For variants which are only present in the somatic callset, fill in the GT of 0/0 and AD of 0 in the germline callset.

Please let me know if you have any more questions and I'll try to help!

vymao commented 3 years ago

Thanks, though I am still a bit confused. What happens then if I only provide one ID, or none? How is the filtering different than if I provide both records?

Also, when you say "you should combine your data into one VCF with two samples" if I have a callset I am confident represents germline variants, what does this mean?