etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
520 stars 163 forks source link

Query regarding identifying gene level CNVs, autobin and bintest #734

Open LavanyaRanganathan95 opened 2 years ago

LavanyaRanganathan95 commented 2 years ago

Hello,

I am using the cnvkit 0.9.9 in order to obtain copy number variants across a tumor-normal cohort. The tumor and corresponding normal sample libraries were prepared using hybrid capture. I am currently using the cnvkit on 8 tumor-normal pairs and would like to scale up the sample size to ~200 tumor normal pairs. My overall goal is to obtain a list of copy number variants across different genes.

I have used the batch command as suggested on the doc sheet and have a few questions based on some of the observations I have made so far.

I used the following commands to generate the antitarget bed file following which I used the batch command on all of the tumor and normal bam files. cnvkit.py antitarget my_target.bed -g data/access-5kb-mappable.hg19.bed -o my_antitargets.bed cnvkit.py batch *tumor.bam -n *normal.bam -t my_target.bed -a my_antitargets.bed -f hg19.fa -m hybrid --segement-method cbs -d output_dir

As per the cnvkit docs, the cnvkit.py callgenerates the absolute copy numbers across segments that were obtained during the segmentation process.In order to identify gene level CNV, would you suggest I compare the breakpoints predicted by cnvkit.py breaks with the call.cns file or would you suggest that I use the bintest.cns file as to my understanding the bintest.cns file also reports the log copy ratios but at the individual bin level and not at the level of a segment. Is my understanding correct? Please correct me if I am wrong. And on the same note, could you please elaborate on the purpose of "bintest.cns"?

I have another question with the autobin script. To my understanding, the autobin helps to identify suitable bin sizes based on the target panel (my_target.bed) being used during the sequencing process. The cnvkit.py autobin generated an output which suggested a suitable bin size of 45 bps. Do you know why the suggested bin size was very low? It is to be noted here that the panel we used for sequencing has 793 baits.

Detected file format: bed Detected file format: bed Estimated read length 151.0 Wrote /tmp/tmpyo1joxc8.bed with 100 regions Limiting est. bin size 885876 to given max. 500000 Splitting large targets Wrote output_dir/bait.target.bed with 4285 regions Wrote output_dir/bait.antitarget.bed with 6157 regions Depth Bin size Target: 2225.647 45 Antitarget: 0.113 500000

I compared this against the target.bed generated from the cnvkit.py batch and this bed file had bin sizes similar to the my_target.bed with exceptions in case of targets which had sizes >360 bps. I also compared it with the bin sizes in bintest.cns, it appears that in the bintest.cns, the bin sizes were ~260 bps and in cases where the tile sizes in the panel >400 bp, the tiles were broken down to smaller bins of 260 bps.

chromosome start end gene depth log2 weight p_bintest chr1 2491217 2491455 TNFRSF14 875.437 -0.681199 0.977831 6.49944e-05 chr1 2492049 2492169 TNFRSF14 856.3 -0.836096 0.966052 7.53761e-05 chr1 2493065 2493299 TNFRSF14 1028.92 -0.527241 0.977826 0.00260868 chr1 2494246 2494365 TNFRSF14 1381.59 -1.18558 0.967556 3.7711e-09 chr1 27101401 27101613 ARID1A 2683.78 0.542985 0.978304 0.00166497 chr1 150551303 150551543 MCL1 3119.3 0.833722 0.97338 6.58256e-06

Hence, could you please explain to me as to how I should interpret the results from autobin and why they are so very different from the other bed files that were generated.

Thanks in advance Regards Lavanya

tetedange13 commented 2 years ago

Hi @LavanyaRanganathan95,

I may try to answer some of your questions:

Regarding CNVkit results

Your understanding of bintest is correct => I cannot really elaborate more about it, except that it uses Z-test p-values (B-H corrected) to extract bins whose CN is significatively altered => Also when you provide it a .cns on top of mandatory .cnr (which is the case through batch), it will consider .cns as a list of known alterations and will try to find some others (see this issue)

I am not sure to understand your goal : "to obtain a list of copy number variants across different genes" => What do you mean by "copy number variants" ? Like for BRCA2 gene, one variant is "exon 11 deletion" and another is "exon 12 duplication" ? => I guess you could join all your *tumor*.cns (after choosing your favorite among (raw).cns ; call.cns ; bintest.cns) => I dnever used breaks subcommand, but maybe it can be useful too => heatmap produces useful representation with several samples, but for 200 it will not be very readable

Regarding autobin

I send you to CNVkit documentation which is very complete => But to sum up, this command uses one or several BAM to rapidely calculate coverage and then estimate target (and antitarget) bin-size (+ write corresponding BED files) => I do not know why it outputs small bin-sizes (I experienced it myself), but in the documentation you have an explanation about how bin-size impacts CNV calling => I personnally do not use autobin and go with default bin-size value for target subcommand (ran by batch), which is 267

Misc

IMO it is not very relevant to compare bin-size of .cns with one obtained for "target" BED files => As, except for bintest.cns, these files contain segments, in other words "gathered bins"

Looking at your batch full command, be aware that you do not have to run antitarget beforehand => Without specifying -a param, batch will generate it automatically (but do not forget to pass access file to batch too with -g / --access param to have consistent results)

Hope this helped ! Have a nice day, Felix.