etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
540 stars 164 forks source link

is autobin need for tumor sample in tumor only mode #640

Closed worker000000 closed 2 years ago

worker000000 commented 3 years ago

Thanks a lot for updating such a powerful tool continously. For cnv calling, bin size is a very important part.

in the paired-mode, we can use the batch mode directly, which involves autobin directly.

but for tumor only mode, no matter with baseline or not, i will address one by one

P1: With baseline, I do as the following

cnvkit.py target my_baits.bed --annotate refFlat.txt --split -o my_targets.bed cnvkit.py antitarget my_targets.bed -g data/access-5kb-mappable.hg19.bed -o my_antitargets.bed cnvkit.py reference -o FlatReference.cnn -f ucsc.hg19.fa -t targets.bed -a antitargets.bed

generate all the normal dedup bam file

cnvkit.py batch -n *Normal.bam --output-reference _newreference.cnn -t my_targets.bed -a my_antitargets.bed -f hg19.fasta -g data/access-5kb-mappable.hg19.bed

I also dedup my tumor bam, here I named it Sample.bam

but I am a totally puzzledthe in following part, should I do autobin for my tumor.bam, to generate a

cnvkit.py access hg19.fa -o access.hg19.bed cnvkit.py autobin Sample.bam -t baits.bed -g access.hg19.bed

For tumor.bam

cnvkit.py coverage Sample.bam baits.target.bed -o Sample.targetcoverage.cnn cnvkit.py coverage Sample.bam baits.antitarget.bed -o Sample.antitargetcoverage.cnn

and then

cnvkit.py fix Sample.targetcoverage.cnn Sample.antitargetcoverage.cnn _newreference.cnn -o Sample.cnr cnvkit.py segment Sample.cnr -o Sample.cns cnvkit.py call Sample.cns -o Sample.call.cns

but this will cause Sample.bam(a tumor sample bam) use a different bin_size with its compared _newreference.cnn, is it reasonable

here I even doubt when I use command

cnvkit.py batch -n *Normal.bam --output-reference _newreference.cnn -t my_targets.bed -a my_antitargets.bed -f hg19.fasta -g data/access-5kb-mappable.hg19.bed

should I use autobin command to generated a seperated my_targets.bed -a my_antitargets.bed for each normal.bam?

the final question, can cnvkit get the exon level annotation, which is always needed, thanks again

worker000000 commented 3 years ago

I use stand DNA sample to check my pipeline, a MET 10 DNA copy, making 4 samples with 50 ng 100 ng 150 ng 200 ng(with umi ). and use 20 normal blood sample to build the baseline, but the MET cn in call.cns is 0, 18, 21, 21, very strange. the latter three seems to be two times of the standard sample

when I want to know the gene level, for here, I want to gene MET copy number, here use the cnr or the call.cns file?

call.cns

image

cnr,

for cnr, does I need to calculate the averager log2ratio image

https://cnvkit.readthedocs.io/en/stable/calling.html

here talking something about gain and loss In a diploid genome, a single-copy gain in a perfectly pure, homogeneous sample has a copy ratio of 3/2. In log2 scale, this is log2(3/2) = 0.585, and a single-copy loss is log2(1/2) = -1.0.

but my dna copy number should be 10, not 20 or near 20, is there any reason for this? Thanks a lot

tetedange13 commented 3 years ago

Hi @worker000000 ,

Not an author of CNVkit, but please read this: https://docs.github.com/en/github/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax => And try to reformat your issue, because: It is very hard to read --> hard to understand what is your actual problem / questions --> hard to help you...

Anyway, some points with few things I could understand from your text:

  1. Make sure you are running CNVkit with correct -m , --seq-method parameter, matching your wetlab (because it has an impact on all downstream steps) => Are you well on a WES dataset? Sequenced using an hybrid-capture kit?

  2. No question here, if you have some "normal" samples, always use them to generate a new_reference.cnn (as you did): CNVkit performs better this way

  3. You are talking about deduplication, but it can skew CNVkit's results sometimes => Try to compare when running CNVkit on NOT-deduplicated BAM?

  4. CNVkit is not an annotation tool, it only adds a minimal annotation to your BED using provided "refFlat" => Annotation is a whole topic in itself and especially "exon-level" annotation (which database? Which transcript for each gene? etc) => I guess you will have to annotate yourself your "targets.bed" at "exon-level", before feeding CNVkit with it

  5. Concerning your "BZ1" sample where you get a copy_numer of 0 on MET => First check if your MET region does not have any coverage issue? This could be more an experimental problem (rather than CNVkit producing inaccurate results) => Or maybe that, for this sample only, you used a deduplicated BAM contrary to the other samples?

  6. Depending on type of "tumor" sample you have, there could be a question of tumor heterogeneity to adress => If you have information of "tumor cellularity", you can feed cnvkit.py call with it and CNVkit will maybe give you copy_values closer to what you expect => Also could you give us details about how you know you should expect this "10 copies of MET"? Was it found by another technique? Because they can have their own flaws too

  7. Yes bin size is important, but most of the time letting CNVkit calculate it automatically is OK => Means you can safely stick with cnvkit.py batch subcommand for each of your tumor samples (less possible mistakes too) => Rather try to test a different segmenting method? (--segment-method parameter of batch)

Hope this helps. Have a nice day. Felix.

worker000000 commented 3 years ago

Thanks a lot for your kind and help and suggestion @tetedange13 My standard sample of MET 10 copy was brought from manufacturer.(it is proved by ddPCR)

All my bam have been dedup, the sample was tas, not pcr ampicon, The default is hybrid. so I do not need to revise

I used batch command to create a reference of 20 normal samples

I do not know how to use batch command to call cnv on a tumor sample and a pool reference

when taling about bin size, I am taliking about the following things, you know when creating a pool refeence, the command cnvkit.py batch -n *Normal.bam --output-reference new_reference.cnn -t my_targets.bed -a my_antitargets.bed -f hg19.fasta -g data/access-5kb-mappable.hg19.bed here needs a my_targets.bed and my_antitargets.bed , I created it by the follwoing command

cnvkit.py target my_baits.bed --annotate refFlat.txt --split -o my_targets.bed
cnvkit.py antitarget my_targets.bed -g data/access-5kb-mappable.hg19.bed -o my_antitargets.bed

but when I need to deal with my tumor.bam(following command called as Sample.bam for the sake of using the docs command), I do not know wherher I should use the same my_targets.bed and my_antitargets.bed as created by the above command or I should so as the following command

MET converage image # image

cnvkit.py autobin tSmaplebam -t baits.bed -g access.hg19.bed --target-output-bed baits.target.bed --antitarget-output-bed baits.antitarget.bed cnvkit.py coverage Sample.bam baits.target.bed -o Sample.targetcoverage.cnn cnvkit.py coverage Sample.bam baits.antitarget.bed -o Sample.antitargetcoverage.cnn cnvkit.py fix Sample.targetcoverage.cnn Sample.antitargetcoverage.cnn my_reference.cnn (which was created from pool normal samples) -o Sample.cnr cnvkit.py segment Sample.cnr -o Sample.cns cnvkit.py call Sample.cns -o Sample.call.cns

so can you see why I talked about the autobin command here , I do not the bed used for the tumor.bam should use which, is a same for each tumor.bam or created by abutobin from each bam? though I have already test autobin for a bam, it raise error "ValueError: Reference is missing 133 bins found in BZ1", so it seems autobin shoulded be used for a tumor.bam in tumor only mode, but with a constant bin for different bam, will it be noisy?

when tallikng about the annotation level, refflat has all the information, transcript and exon number, it will be better if cnvkit can do this

I also aksed when report a copy number for a gene, we should use the cnr or the cns? Purity is important as you also said, but I only have tumor.bam, I do not have the paired whilte blood sample, I do not know how to calculate purity, so here why I asked the purity tool(especially the tumor only putiry tool ALL-FIT)

tetedange13 commented 3 years ago

Hi @worker000000,

About your commands

All my bam have been dedup, the sample was tas, not pcr ampicon, The default is hybrid. so I do not need to revise

  1. Are you 100% sure about that? Because "TAS = Targeted Amplicon Sequencing", right? If yes, it is exactly why the --method amplicon of CNVkit was designed for
  2. Create your reference with: cnvkit.py batch -n *Normal.bam -d <output_dir> -t <my_baits.bed> -f hg19.fasta -g data/access-5kb-mappable.hg19.bed --annotate refFlat.txt => Were "my_baits.bed" is your manufacturer BED, with regions expected to be sequenced => CNVkit will automatically create in : "reference.cnn", "my_baits.target.bed" and "my_baits.antitarget.bed" (all already annotated with gene name)
  3. Then the simplest way to run CNVkit on a tumor sample with "reference" previously made from your normal samples is: cnvkit.py batch BZ1_tumor.bam -d <output_dir> -r <output_dir>/reference.cnn => As you see, no need to specify "my_baits.target.bed" nor "my_baits.antitarget.bed", as CNVkit can deduce them from your "reference.cnn" (you should see in 2 "tmp" files called "reference.target-tmp.bed" and "reference.antitarget-tmp.bed") => You can also run same command on all your tumor_BAM in once: cnvkit.py batch BZ*_tumor.bam -d <output_dir> -r <output_dir>/reference.cnn, CNVkit will process them each after another

About coverage issues

About how to exploit CNVkit results

Have a nice day. Felix.

worker000000 commented 3 years ago

Thanks a lot for your kind help @tetedange13 my sample is target hybird , not amplicon, when I use the dedup bam(extract umi from fastq, this is a littlie closer to the 10 copy), and do bwa and gatk markduplicates, the result is as follows, and BZ1(the least sample input becomes normal)

cnr file

image

cns file

image

call.cns file, this result is strange, BZ is 15 copy, BZ2,3,4 is 4 copy, can you give some help about this

image

do you think there is any other advice to make it more closer to 10 copy MET

worker000000 commented 3 years ago

in fact, I alsos use decon, but the result is more strange, it can not gvie any tumor bam cnv ,but gives so normal bam cnv, though the correlation is very high image

but I have not done cnvkit.py metrics .cnr -s .cns, because the batch command does not output cnr and cns file, if it can included in the batch command, it can be more convenient

cnvkit.py batch -n *Normal.bam --output-reference new_reference.cnn \ -t my_targets.bed -a my_antitargets.bed \ -f hg19.fasta -g data/access-5kb-mappable.hg19.bed

tetedange13 commented 3 years ago

Hi @worker000000,

  1. If after stopping UMI-dedup your BZ1 came back to normal regarding coverage / log_ratio, I guess this a proof that UMI-deduplication can skew CNVkit results => However, you should try without using GATK MarkDuplicates, as this can bring biais too (run CNVkit on simple BAM, right after alignment with BWA) => Indeed duplicate-marked reads are ignored by CNVkit

  2. Once previous step done, stick with (raw) ".cns" for now (easier to analyze) => Because in "call.cns" produced by batch, there is segment filtering and consecutive "gained" regions are agreggated => In your (raw) ".cns" and ".cnr", CN=10 should correspond to log_ratio =~ 2.32 => In general you have log_ratio higher than expected (even if BZ1 is not that far, in your latest results), let us see without "GATK MarkDuplicates"

  3. I am not sure if this will help in your case, but have you tried to use another segmenting method already? => This can be easily changed with --seg-method parameter of batch (at 2nd step I gave you, after having created your pooled reference) => In my experience, HAAR method can perform well on tumor data

  4. This is expected for batch to not produce ".cnr" and ".cns" files for "normal" samples when creating pooled reference => If you want some ".cnr" and ".cns" for your "normal" dataset, you should run calling pipeline for "normal" samples only, against a "flat reference" => Command should be: cnvkit.py batch *Normal.bam -n -t my_targets.bed -a my_antitargets.bed -f hg19.fasta -g data/access-5kb-mappable.hg19.bed (Parameter -n is after the BAM filenames and nothing after it after means "flat reference")

Have a nice day. Felix.

worker000000 commented 3 years ago

@tetedange13 Thanks a lot for your helpful suggestion, the tumor bam is the sort bam, but the pool reference is dedup(all the test, including the above, the pool bam is dedup), the result seems to be not good, and the gap between BZ1 and other three enlarger a lot image

the cnvkit docs said it should use dedup.bam https://cnvkit.readthedocs.io/en/stable/pipeline.html?highlight=bam#bam-file-preparation BAM file preparation For best results, use an aligner such as BWA-MEM, with the option to mark secondary mappings of reads, and flag PCR duplicates with a program such as SAMBLASTER, SAMBAMBA, or the MarkDuplicates script in Picard tools, so that CNVkit will skip these reads when calculating read depth.

image

etal commented 3 years ago

That note in the docs probably needs to include some additional caveats now -- the advice is basically to use the same BAM for CNV calling as you do for SNV calling, only because that was the case with the hybrid capture samples I originally used to validate CNVkit. The guidance is definitely incorrect for TAS, and for your samples, empirically, it looks like marking duplicates gives worse results.

worker000000 commented 3 years ago

@etal Thanks a lot for your advice, my sample is hybrid sample, not multiplex pcr sample, and according to your docs, it should be dedup, and i use dedup bam to call snv use vardict

the result is not close to the standard sample, so it bothers me a lot

I have three panels,

smapanel.bed 0.1M
midpanel.bed 0.5M bigpanel.bed 2.2M

and one standard DNA of MET 10 copy, all the blood samle does not umi, all the standard sample has umi

for smapanel.bed sample, I sequenced 20 normal sample once(all is the same time), and as you can see, I make different hybird of input 50 100 150 200 ng for this sample, generated 4 samples, the result shows as above, the cns result is 16-20 copy of MET, but the call.cns result varies much

because of money, we do not sequence midpanel.bed and bigpanel.bed this time in a same time, but collected the past. for midpanel.bed sample. I collected some past blood sample of this sample, about 70 samples, and dedup normal bam and create the pon, I also make a standard MET sample of this(this time we do not have a gradient, this libaray also have UMI), I does not extract umi, and do align and gatk markdup, and $cnvkit_py batch BZ_middle.sort_dedup.bam -d . -r midpanel.bed__reference.cnn --rscript-path R3.6.3/bin/Rscript --drop-low-coverage

result is as follows, also about 2 times of 10 copy image

and the bigpanel I used about 300 blood samples for pon, result is as follows, also about 2 times of 10 copy image

I also tried to use cnvkit.py metrics .cnr -s .cns to select the more suitable ones, but I found some with much higher segments, some with higher std, is there a Practical methods can be used to screen samples

image

worker000000 commented 3 years ago

@tetedange13 this is the haar method, thanks a lot

{haar – a pure-Python implementation of HaarSeg, a wavelet-based method. Very fast and performs reasonably well on small panels, but tends to over-segment large datasets. }

there is not a definition of small panels

image

worker000000 commented 3 years ago

@etal @tetedange13 it seems that the result of cnv divide 1.79 will be close to the 10 copy except the BZ1 this is a summary result of my whole test(all bam dedup, tumor bam does not extract umi from fastq, and do dedup also) image

worker000000 commented 3 years ago

any future comment about this? thanks in advance

worker000000 commented 3 years ago

@etal @tetedange13 any futher comment about this? thanks in advance

etal commented 3 years ago
worker000000 commented 3 years ago

Thanks a lot @etal

  • for the second point you said, this time it is a 10 copy standard sample, but if it is a 3 copy standard sample,then cnvkit will report it as 6 copy, or it is a 2.5 copy standard sample,then cnvkit will report it as 5 copy, so what should i do?
worker000000 commented 3 years ago

@etal @tetedange13 any futher comment about this? thanks in advance