etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
502 stars 163 forks source link

Incremental reference Generation #787

Open justin-greenblatt opened 1 year ago

justin-greenblatt commented 1 year ago

I have a use case where my database grows with time.

I start off with 15 BAM files but expect to receive another 100 BAM files during the course of this year. I expect to call CNV variants for all samples. I start the process by feeding the 15 initial bam files to the "autobin" command generating the "target.bed" and "antitarget.bed" bins. Afterwards, I calculate their coverages (.cnn files) with the "coverage" command for all of these initial 15 samples and use them to build the "reference.cnn". I later use this "reference.cnn" for CNV calling using the "fix","segment" and "export" command.

Let's say I receive another 10 Samples for my project. Can I use the same "target.bed" and "antitarget.bed generated previously with "autobin" to calculate their coverages? This way I could only calculate the coverage of the new samples and re-run the "reference" command. I would have to run the other downstream command for CNV calling again but would skip the costly "coverage" calculation of my old samples.

The other, and less practical option is to run "autobin" again feeding 15 + 10 samples and generating new bins and new coverage files for all of the initial samples again. All of this before running the other downstream processes.

My question can also be seen as:

-How important is the group of BAM files given to "autobin"? From what I read the command takes the median sized BAM file and uses it to generate the bins.

-If only a small subset of the final group of BAM files is given to "autobin", will that negatively affect my whole CNV calling process with cnvkit?

Thanks to anyone who has read this question. I believe others have had this issue to.

Best Regards :) Justin

tetedange13 commented 1 year ago

Hi @justin-greenblatt,

1°) Regarding binning

What is your type of sequencing method ? Hybridization capture ? Amplicon ? WGS ? => Because according to documentation, CNVkit actually uses a default (approximate) bin size of 267 for "hybrid-capture" (maybe for amplicon too, but not sure) => In my experience, this value works fine in most cases, so you can probably start with it and skip autobin part

If you want to stick with autobin anyway, you read right as CNVkit documentation says : "If multiple BAMs are given, use the BAM with median file size." => Chosen set of BAM for autobin will indeed affect your whole CNV calling process, but I could not say if negative or positive nor in which magnitude

2°) Regarding reference building

I guess all your samples are considerered "tumor" (= you can expect them to present a CNV in one of your gene of interest) ? Or do you have a "normal normal" for each "tumor sample" (= paired) => Because CNVkit can create a reference with NO normal sample (= "flat reference") => It could be useful in your case, as you would create this flat reference once (with cnvkit.py reference or else cnvkit.py batch --normal <NO_BAM> --output-reference my_flatRef.cnn) => Then feed it to all your fix commands such as : cnvkit.py fix sample1.targetcoverage.cnn sample1.antitargetcoverage.cnn my_flatRef.cnn

Above batch command to build flat reference will use default bin size of 267 => Unless you use options --targets and --antitargets with files produced by autobin (I think so, never used it myself) => See also cnvkit.py batch --help
Hope this helps ! Best regards, Felix