etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
547 stars 165 forks source link

diagram parameter #629

Closed lmanchon closed 3 years ago

lmanchon commented 3 years ago

--Hi all,

is there an option to plot diagram without gene labels ? i have too much gene name on my chromo.

thank you --

tetedange13 commented 3 years ago

Hi @lmanchon ,

Not an author of CNVkit, but AFAIK, it is not possible to disable them completely => Maybe try to reduce the number of called CNV, as explained here in CNVkit's documentation ?

Hope this helps. Have a nice day. Felix.

tskir commented 3 years ago

Hi @lmanchon!

The suggestion by @tetedange13 indeed makes a lot of sense. If you're having so many gene labels that they're interfering with the diagram, changes are that something went wrong in the process of calling the CNVs. Would you mind sharing the diagram so that we could see what could be the reason?

However, if you're certain that the calling parameters are correct and there is a valid reason for there to be so many CNVs (e.g. it's a highly mixed up tumour genome), I suppose we could add a flag to disable gene names, it shouldn't be too hard.

In the meanwhile, as a workaround, you can simply edit CNR/CNS file and to fill the gene column with dashes (-)

lmanchon commented 3 years ago

see below the plot generated with those commands: cnvkit.py batch SRR6823378_recalibrated.bam -n -m amplicon --segment-method hmm-germline -f Homo_sapiens_hg38.fasta --annotate refFlat.txt --short-names -t exome.bed -d OUTPUT_CNV -p 0 --drop-low-coverage --target-avg-size 300 cnvkit.py scatter -s OUTPUT_CNV/SRR6823378_recalibrated.cn{s,r} --y-max 5 --y-min -5 --title SRR6823378 -o OUTPUT_CNV/SRR6823378_CNV_scatter-log2.pdf cnvkit.py diagram -s OUTPUT_CNV/SRR6823378_recalibrated.cn{s,r} -t 1 -o OUTPUT_CNV/SRR6823378_CNV_diagram.pdf

diagram scatter

tetedange13 commented 3 years ago

Hi @lmanchon ,

You have a WES germline dataset, so no particular reason for you to have many CNV I guess? (I mean regarding @tskir remark) => However are you really on an "amplicon" method? If yes, have you try to stop marking duplicate reads, as you were suggested in #628 ?

You also are setting a lot of custom values for parameters. Any particular reason for that? Have you tested different things? => For example Germline Analysis part of CNVkit documentation says using --drop-low-coverage can be risky in your case => And "hmm*" segmenting methods usually produce more small segments (to detect more focal CNV), so maybe it comes from there" ?

Hope this helps. Have nice day. Felix.

lmanchon commented 3 years ago

--Hello @tetedange13,

as you suggested i'm going to try using another parameters, also i use this bed file: https://www.twistbioscience.com/sites/default/files/resources/2019-06/Twist_Exome_Target_hg38.bed

what do you mean by "amplicon" method ? is WES analysis an amplicon method or not ?

thank you so much--

tetedange13 commented 3 years ago

To sum up, except for WGS any NGS method requires an enrichment step, to get regions of interest (eg: panel of genes, whole exome etc) => To do so several methods exist, with their own properties, such as:

CNVkit can handle all 3 methods but was primarily designed for hybrid-capture data (see this part of CNVkit's documentation for detailed behaviour depending on the method) => Also I see you are working on this public dataset right? In metadata I saw it is obtained with "Agilent SureSelect XT Human All Exon Kit V5" and not Twist Exome... => So you should rather use one of these BED files, supposing your BAM were made against hg38 ?

Anyway your biggest problem is setting -m amplicon when you have sequencing data made with hybrid-capture method (both Twist's and Agilent's exomes are) => Simply remove this (default method is -m hybrid(-capture)), stick with default parameters to start and see if this is better

Best, Felix.

lmanchon commented 3 years ago

--Hi,

i have tried again with the commands below (with or without mark duplicates), i obtain the attached results(see pictures).

cnvkit.py batch SRR6823378.bam -n -m hybrid -f Homo_sapiens_hg38.fasta -t Exome-Agilent_V5.bed -d OUTPUT_CNV -p 0 cnvkit.py scatter -s OUTPUT_CNV/SRR6823378.cn{s,r} --y-max 5 --y-min -5 --title SRR6823378 -o OUTPUT_CNV/SRR6823378_CNV_scatter-log2.pdf cnvkit.py diagram -s OUTPUT_CNV/SRR6823378.cn{s,r} -o OUTPUT_CNV/SRR6823378_CNV_diagram.pdf

i have less points on the scatter but too much labels on diagram.

scatter diagram

tetedange13 commented 3 years ago

Hi @lmanchon ,

Looks way better to me, compared to your previously shared scatter plot ! => But someone else might confirm ?

Regarding your diagram:

Anyway if you are still having a lot of CNV, a quick workaround could be to simply change to diagram -s OUTPUT_CNV/SRR6823378.call.cns (this one has .call. in its name) => Complete new command: cnvkit.py diagram -s OUTPUT_CNV/SRR6823378.call.cns -t 1 -o OUTPUT_CNV/SRR6823378_CNV_diagram.pdf OUTPUT_CNV/SRR6823378.cnr => This ".call.cns" file should be present in your output directory if you are running latest CNVkit version (which I advice you to) and has been post-processed to remove some likely false-positive CNV (see here for details) => And as said before, you should find details in CNVkit's documentation about this topic

Also a flag to disable gene_name labels should be available anytime soon

Kind regards. Felix.

lmanchon commented 3 years ago

cnvkit.py version 0.9.9

ok thank you, i'm investigating some change. btw another parameter would be usefull: possibility to plot diagram for only one chromosome (as scatter option) In attachment the result file from SRR6823378 analysis. SRR6823378.zip

tskir commented 3 years ago

Reopening the issue until #634 is merged