unable to 'scatter' plot region of chromosome name with punctuation marks

amora197 commented 3 years ago

Hello,

I am trying to plot a region of a plant chromosome but to no avail. I first plotted successfully the whole chromosome NC_039902.1 of my sample 1-C7 using the following command:

cnvkit.py scatter 1-C7.cnr \
-s 1-C7.cns \
-c NC_039902.1 \
-o 1-C7-NC_039902.1.png

When I want to zoom into a specific region of the chromosome, I get the following error:

ValueError: Invalid range spec: NC_039902.1:10000000-15000000 (should be like: chr1:2333000-2444000)

The command I am using for plotting the region wanted is the following:

cnvkit.py scatter 1-C7.cnr \
-s 1-C7.cns \
-c NC_039902.1:10000000-15000000 \
-o 1-C7-NC_039902.1-zoomed.png

It seems that when I provide a region in -c NC_039902.1:10000000-15000000, the CNVkit raises an error due to the period or point . in the chromosome name NC_039902.1. The error is not raised when I delete the chromosome name suffix .1, but obviously the output figure is empty since the specific chromosome name NC_039902 does not exist in my bed/fasta/bam files.

I am using the original NCBI chromosome names and wish to keep them unchanged. Is there a way around this issue for the cnvkit.py scatter command to tolerate punctuation marks in the chromosome names? Some of the chromosome names I am using are: NC_039898.1 NC_039899.1 NC_039900.1 NC_039901.1 NC_039902.1 etc. And not: chr1 chr2 chr3 etc.

Thanks ahead of time!

-Anibal

tskir commented 3 years ago

Well... let's just say this wasn't a use case that was ever really considered :D

Thank you for reporting this, I've submitted #603 which should fix this problem, hopefully even without breaking anything.

amora197 commented 3 years ago

Hi Kirill Tsukanov,

Thank you for the fast response and quick fix!

I went ahead and did the quick change in the skgenome/rangelabel.py for handling NCBI-style chromosome naming as seen in your pull-request. I then re-ran the commands for plotting the whole chromosome and a region of the chromosome. This time the command for zooming into a region ran through and gave me an output, but although the command line can now handle NCBI-style of naming, the output plot is still off. I also noticed that the zoomed-in-region command can output a pdf but not a png. Here I have attached the commands and output pdf's.

No zoom of chromosome NC_039902.1:

Command:

cnvkit.py scatter 1-C7.cnr \
-s 1-C7.cns \
-c NC_039902.1 \
-o 1-C7-NC_039902.1.no_zoom.pdf

Output: catuai-vs-1-C7-NC_039902.1.no_zoom.pdf

Zoom into region 10000000-15000000 of chromosome NC_039902.1:

Command:

cnvkit.py scatter 1-C7.cnr \
-s 1-C7.cns \
-c NC_039902.1:10000000-15000000 \
-o 1-C7-NC_039902.1.zoom.pdf

Output: catuai-vs-1-C7-NC_039902.1.zoom.pdf

Is there an edit that can correctly output the zoomed-in region? We are very close to getting the right output!

Thanks ahead of time, -Anibal

tskir commented 3 years ago

@amora197 Hmm, that's interesting, thanks for letting me know. I didn't have any plant test data at hand, so I just ran my tests using human data with one of the chromosomes renamed to NC_* notation. But apparently there's more to it.

Could you share your .cnr and .cns files please so that I can reproduce this? If it can't be shared publicly please send directly to ktsukanov@ebi.ac.uk

tskir commented 3 years ago

Note to self: files received

tskir commented 3 years ago

PR is merged, but reopening for additional remaining investigations

tskir commented 3 years ago

@amora197 Thank you for waiting; I was now able to take a look into this.

It loooks like the gene column in your CNR file, instead of a gene name, contains an extended set of what looks like GFF3 key-value annotations:

chromosome   start  end   gene                                                                                                                          depth  log2       weight
NC_008535.1  1607   1642  ID=exon-CoarCt002-1;Parent=rna-CoarCt002;Dbxref=GeneID:4421840;gbkey=tRNA;gene=trnK-UUU;locus_tag=CoarCt002;product=tRNA-Lys  7      -0.458055  0.518118

This makes cnvkit.py segment interpret each line as a separate gene with a really long name. Assuming that the gene name in the annotation above is trnK-UUU, the CNR file should instead look like this:

chromosome   start  end   gene      depth  log2       weight
NC_008535.1  1607   1642  trnK-UUU  7      -0.458055  0.518118

In turn, the gene names in your CNR file look like this almost certainly because something went wrong during the target construction (cnvkit.py target step). Possibility 1 is that the original BED file you used had gene names like this; possibility 2 is that the gene names ingested from the --annotate flag were in this format and went on to supersede the ones in the BED file.

So, in light of this, could you please elaborate how your target files were constructed? What were the commands/flags used and how does the original (unprocessed) target BED look like?

amora197 commented 3 years ago

@tskir Thank you for your reply and for the wait. We took into account your suggestions and tried some troubleshooting.

We have indeed used a GFF3 file to build our bed file (general case for us). The GFF3 file we used was downloaded using the following NCBI websites:

We had previously refrained from editing the gene column. We want to keep as much information as possible for downstream analysis and, more importantly, no individual tag in the column is unique. After your suggestions, we decided to reduce to the ID= field in the gene column since it was at least present in each row. The resulting plot was the following:

catuai-vs-1-C7-NC_039902 1 gene-id-only-zoom

We realized that there was non-uniqueness in the ID= field under the gene column, which we figured might cause the glitch. We hence created a unique entry gene name for each row by simply assigning a number. The resulting plot was the following:

catuai-vs-1-C7-NC_039902 1 counter-zoom

We realized that the gene name vertical buildup might be due to overlapping regions, so we merged overlaps with bedtools merge on our bed file. Running bedtools merge discards the gene column, so we once again gave each row a unique gene numerical ID. The resulting plot was the following:

catuai-vs-1-C7-NC_039902 1 bedtools-merge-count-zoom

In summary, we were able to zoom into a region of interest after preparing our bed file like so:

sort bed file
merge overlapping regions of bed file
assign unique ID under gene column per row entry

More general, though: Can you please detail how the ideal bed file for running CNVkit needs to look like? More specifically, could we please get advised on the following:

Do gene name column entries need to be unique? What is the max allowed length of the string?
Are overlapping regions allowed? (Imagine overlapping genes, on opposite strands or splice variants)
Is there a minimum region size? and on a similar note: Is CNVkit splitting large regions into smaller windows?

Finally we have a feature request: Our organisms are triploid. Do you see a scope for an optional ploidy parameter in CNVkit, so that we detect 2 vs 3 (diploid vs triploid) rather than your default 1 vs 2 (diploid vs haploid), a respective command-line flag would be nice.

Thank you again for your assistance. We look forward to your reply.

etal commented 3 years ago

Thanks for the details, @amora197 .

CNVkit can read GFF directly; you can give a GFF3/GTF/GFF2 file as input in most places where BED works. You can use the bundled script skg_convert.py to just convert tabular files from one supported format to another:

https://cnvkit.readthedocs.io/en/stable/scripts.html
All input formats: https://github.com/etal/cnvkit/blob/master/skgenome/tabio/__init__.py#L113-L134
GFF implementation: https://github.com/etal/cnvkit/blob/master/skgenome/tabio/gff.py

The GFF reader checks for these tags in order: Name, gene_id, gene_name, gene. It will take the value of the first match and use it as the "gene" for the rest of CNVkit's purposes.

Gene names do not need to be unique. Apparently the gene name is allowed to be pretty long, as you've seen; CNVkit does not enforce any limit, but the plots start to look weird when the strings are very long.

Overlapping regions are allowed. There is no minimum region size; 0 or smaller will tend to get dropped automatically within the CNVkit pipeline.

When you give a target BED as input to the batch or target command, long regions will automatically be split into smaller windows. It's possible to skip this with different command line options. Details here: https://cnvkit.readthedocs.io/en/stable/pipeline.html#target

CNVkit will work for non-diploid genomes already. The call command has an option --ploidy which defaults to 2, but you can use 3 or another integer here instead.

etal / cnvkit

unable to 'scatter' plot region of chromosome name with punctuation marks #602

No zoom of chromosome NC_039902.1:

Zoom into region 10000000-15000000 of chromosome NC_039902.1: