It's my autobin result ,is it normal ?

rahhhm commented 3 years ago

Detected file format: bed Detected file format: bed Estimated read length 101.0 Wrote /tmp/tmp9g_1ifn8.bed with 100 regions Splitting large targets Wrote target.target.bed with 248885 regions Wrote target.antitarget.bed with 147582 regions Depth Bin size Target: 42.422 2357 Antitarget: 0.804 124348

tskir commented 3 years ago

Hi @rahhhm, it's a bit hard to say without knowing more about your data and workflow; but in general, yes, these are sensible results for the auto binning. Is there something which makes you suspect these might be not correct?

rahhhm commented 3 years ago

Thank you for your answering I read that the average bin size is almost 200 in the docs, so I thought it was abnormal. can you give me any advice? now I changed the target bed file Exome-AZ_V2.bed in AZ repository to cds.bed parsed from UCSC. it is a new result of autobin Detected file format: bed Detected file format: bed Estimated read length 101.0 Wrote /tmp/tmpokwppctr.bed with 100 regions Splitting large targets Wrote my_target.target.bed with 216906 regions Wrote my_target.antitarget.bed with 138646 regions Depth Bin size Target: 75.729 1320 Antitarget: 1.664 60083

and then it's my scatterplot. auto splot

I tried changing the bin sizes or options, but the result didn't change the way I wanted it to. I wanna down my log2 copy ratio threshold. what can I do that?

tetedange13 commented 3 years ago

Hi @rahhhm ,

@tskir asked you details about your dataset and your workflow => Meaning we need to know:

Wetlab technique used to obtain your data (amplicon, hybrid-capture, WGS ...)
Your type of target (WES, WGS, panel ...)
If you are using a public dataset, tell us which one
Are you in a case of germline, a case of tumor-only, a case of matched tumor-normal, a case of tumor versus pool of normals etc
CNVkit commands you ran, with every parameter (+ CNVkit version if not latest)

Seeing your scatter plot, I think to a problem of --seq-method parameter not matching your actual wetlab

Hope this helps. Kind regards. Felix.

rahhhm commented 3 years ago

-hybrid capture technique (sureselect v6) -WES -not public data -tumor-only -coammands cnvkit.py target cds.bed --annotate refFlat.txt --split --short-names -o my_targets.bed cnvkit.py antitarget my_target.bed -g ../reference/access.bed -o my_antitarget.bed cnvkit.py coverage ../bam/Flagged.aln.bam my_target.bed -o my.targetcoverage.cnn -p 8 cnvkit.py coverage ../bam/Flagged.aln.bam my_antitarget.bed -o my.antitargetcoverage.cnn -p 8 cnvkit.py reference -o my.FlatReference.cnn -f ../reference/hg38.fa -t my_target.bed -a my_antitarget.bed cnvkit.py segment Sample.cnr -o Sample.cns

I changed bin size 267 - 1500, but I can't get a plot that I want thank you

tetedange13 commented 3 years ago

Hi @rahhhm ,

Before going any further, I want to adress what you said:

but I can't get a plot that I want

What do you mean? What are you expecting? Are you working on a controlled tumor sample, with an expected CNV in a specific gene (and validated by another wetlab technique) ? Or is it the plot that looks weird to you? If yes, why ? => Sometimes tumor samples have noisy DNA in itself and WES capture can add variability too => Supposing your calling pipeline is correct (adressed bellow), "flat reference" is the hardest method for CNVkit, compared to matched and pooled reference => To sum up maybe this plot you shared is completely expected regarding your data ?

About targets/baits BED used

As said in CNVkit documentation:

The BED file should be the baited genomic regions for your target capture kit, as provided by your vendor

So regarding your change to "cds.bed" parsed from UCSC: it is a bad idea IMO => If you are on "Agilent's SureSelect V6" kit, I guess you should use this BED file ? => Also not sure if this "Exome-AZ_V2.bed" you 1st mentionned, still matches your current version of this NGS exome kit

About your commands

You forgot to share your detailed autobin and fix commands => But why not giving a try to batch ? With default bin-size that performs well most of the time

Create your "flat reference" (once): cnvkit.py batch -n -d <output_dir> -t <my_baits.bed> -f hg38.fasta -g access.bed --annotate refFlat.txt --short-names
Run calling pipeline on (each of) your BAM against this "flat reference": cnvkit.py batch Flagged.aln.bam -d <output_dir> -r <output_dir>/reference.cnn -p 8 (with "reference.cnn" only, CNVkit can deduce both "my_baits.target.bed" and "my_baits.antitarget.bed")
Plot your results: cnvkit.py scatter -s <output_dir>/Flagged.aln.cns <output_dir>/Flagged.aln.cnr -y-min -5 --y-max 5

I also wanted to add 2 things:

I see you do not use --drop-low-coverage at all? This can remove some noise (see here, 2nd paragraph)
Be cautious if your BAM is de-duplicated or simply duplicate-marked, as this can skew results sometimes (best approach is to compare with/without duplicates results)

Best, Felix.

rahhhm commented 3 years ago

thank you so much @tetedange13 I solved those problems with your advice.

etal / cnvkit

It's my autobin result ,is it normal ? #642

About targets/baits BED used

About your commands