etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
559 stars 166 forks source link

Diagram plotting output changes drastically when using the annotate option #650

Open gtollefson opened 3 years ago

gtollefson commented 3 years ago

When I run the batch command on one tumor/normal low-pass whole genome sequencing sample pair without the --annotate option, I receive very different diagram plot output from that produced by running the batch command on the same sample using the --annotate option and the refFlat.txt file which corresponds with my reference genome version. I've pasted the output of the two batch command runs below.

Output produced with default batch commands without the --annotate option:

Screen Shot 2021-07-19 at 3 46 43 PM

Output produced with the same commands as above but with the annotate option provided with the refFlat.txt file appropriate to the reference genome:

Screen Shot 2021-07-19 at 3 46 34 PM

My full batch command is:

cnvkit.py batch DNA-T2.sorted.bam --normal DNA-N2.sorted.bam \ --fasta GRCh38_full_analysis_set_plus_decoy_hla.fa \ --output-reference GRCh38_full_analysis_set_plus_decoy_hla.cnn --output-dir results/ \ --diagram --scatter --method wgs

with and without --annotate refFlat.txt

Is this expected behavior? Can you explain why the two output plots are different (aside from the gene labels)?

Thank you, George

tetedange13 commented 3 years ago

Hi @gtollefson,

Not an author of CNVkit, but thanks for reporting this ! I could "reproduce" a bit this on my own hybrid-capture panel data, using your batch command

Precisions

==> So IMHO this is purely a diagram graphical artifact

Investigations

I think this is due to several things:

  1. Without annotation, CNVkit attributes "-" as a gene_name for all these "Target" regions used
  2. But when squashing step happens (gather consecutive regions by gene name?) all these unannotated "-" regions are ignored because they are part of params.IGNORE_GENE_NAMES

To sum up, bin-representations of diagram are different depending on --annotate because:

Remaining questions

Hope this helps. Have a nice day. Felix.