Open amora197 opened 3 years ago
Well... let's just say this wasn't a use case that was ever really considered :D
Thank you for reporting this, I've submitted #603 which should fix this problem, hopefully even without breaking anything.
Hi Kirill Tsukanov,
Thank you for the fast response and quick fix!
I went ahead and did the quick change in the skgenome/rangelabel.py
for handling NCBI-style chromosome naming as seen in your pull-request. I then re-ran the commands for plotting the whole chromosome and a region of the chromosome. This time the command for zooming into a region ran through and gave me an output, but although the command line can now handle NCBI-style of naming, the output plot is still off. I also noticed that the zoomed-in-region command can output a pdf but not a png. Here I have attached the commands and output pdf's.
Command:
cnvkit.py scatter 1-C7.cnr \
-s 1-C7.cns \
-c NC_039902.1 \
-o 1-C7-NC_039902.1.no_zoom.pdf
Output: catuai-vs-1-C7-NC_039902.1.no_zoom.pdf
Command:
cnvkit.py scatter 1-C7.cnr \
-s 1-C7.cns \
-c NC_039902.1:10000000-15000000 \
-o 1-C7-NC_039902.1.zoom.pdf
Output: catuai-vs-1-C7-NC_039902.1.zoom.pdf
Is there an edit that can correctly output the zoomed-in region? We are very close to getting the right output!
Thanks ahead of time, -Anibal
@amora197 Hmm, that's interesting, thanks for letting me know. I didn't have any plant test data at hand, so I just ran my tests using human data with one of the chromosomes renamed to NC_*
notation. But apparently there's more to it.
Could you share your .cnr
and .cns
files please so that I can reproduce this? If it can't be shared publicly please send directly to ktsukanov@ebi.ac.uk
Note to self: files received
PR is merged, but reopening for additional remaining investigations
@amora197 Thank you for waiting; I was now able to take a look into this.
It loooks like the gene
column in your CNR file, instead of a gene name, contains an extended set of what looks like GFF3 key-value annotations:
chromosome start end gene depth log2 weight
NC_008535.1 1607 1642 ID=exon-CoarCt002-1;Parent=rna-CoarCt002;Dbxref=GeneID:4421840;gbkey=tRNA;gene=trnK-UUU;locus_tag=CoarCt002;product=tRNA-Lys 7 -0.458055 0.518118
This makes cnvkit.py segment
interpret each line as a separate gene with a really long name. Assuming that the gene name in the annotation above is trnK-UUU
, the CNR file should instead look like this:
chromosome start end gene depth log2 weight
NC_008535.1 1607 1642 trnK-UUU 7 -0.458055 0.518118
In turn, the gene names in your CNR file look like this almost certainly because something went wrong during the target construction (cnvkit.py target
step). Possibility 1 is that the original BED file you used had gene names like this; possibility 2 is that the gene names ingested from the --annotate
flag were in this format and went on to supersede the ones in the BED file.
So, in light of this, could you please elaborate how your target files were constructed? What were the commands/flags used and how does the original (unprocessed) target BED look like?
@tskir Thank you for your reply and for the wait. We took into account your suggestions and tried some troubleshooting.
We have indeed used a GFF3 file to build our bed file (general case for us). The GFF3 file we used was downloaded using the following NCBI websites:
We had previously refrained from editing the gene
column. We want to keep as much information as possible for downstream analysis and, more importantly, no individual tag in the column is unique. After your suggestions, we decided to reduce to the ID=
field in the gene
column since it was at least present in each row. The resulting plot was the following:
We realized that there was non-uniqueness in the ID=
field under the gene
column, which we figured might cause the glitch. We hence created a unique entry gene
name for each row by simply assigning a number. The resulting plot was the following:
We realized that the gene
name vertical buildup might be due to overlapping regions, so we merged overlaps with bedtools merge
on our bed file. Running bedtools merge
discards the gene
column, so we once again gave each row a unique gene
numerical ID. The resulting plot was the following:
In summary, we were able to zoom into a region of interest after preparing our bed file like so:
gene
column per row entryMore general, though: Can you please detail how the ideal bed file for running CNVkit needs to look like? More specifically, could we please get advised on the following:
gene
name column entries need to be unique? What is the max allowed length of the string?Finally we have a feature request: Our organisms are triploid. Do you see a scope for an optional ploidy parameter in CNVkit, so that we detect 2 vs 3 (diploid vs triploid) rather than your default 1 vs 2 (diploid vs haploid), a respective command-line flag would be nice.
Thank you again for your assistance. We look forward to your reply.
Thanks for the details, @amora197 .
CNVkit can read GFF directly; you can give a GFF3/GTF/GFF2 file as input in most places where BED works. You can use the bundled script skg_convert.py
to just convert tabular files from one supported format to another:
The GFF reader checks for these tags in order: Name
, gene_id
, gene_name
, gene
. It will take the value of the first match and use it as the "gene" for the rest of CNVkit's purposes.
Gene names do not need to be unique. Apparently the gene name is allowed to be pretty long, as you've seen; CNVkit does not enforce any limit, but the plots start to look weird when the strings are very long.
Overlapping regions are allowed. There is no minimum region size; 0 or smaller will tend to get dropped automatically within the CNVkit pipeline.
When you give a target BED as input to the batch
or target
command, long regions will automatically be split into smaller windows. It's possible to skip this with different command line options. Details here:
https://cnvkit.readthedocs.io/en/stable/pipeline.html#target
CNVkit will work for non-diploid genomes already. The call
command has an option --ploidy
which defaults to 2, but you can use 3 or another integer here instead.
Hello,
I am trying to plot a region of a plant chromosome but to no avail. I first plotted successfully the whole chromosome
NC_039902.1
of my sample1-C7
using the following command:When I want to zoom into a specific region of the chromosome, I get the following error:
The command I am using for plotting the region wanted is the following:
It seems that when I provide a region in
-c NC_039902.1:10000000-15000000
, the CNVkit raises an error due to the period or point.
in the chromosome nameNC_039902.1
. The error is not raised when I delete the chromosome name suffix.1
, but obviously the output figure is empty since the specific chromosome nameNC_039902
does not exist in my bed/fasta/bam files.I am using the original NCBI chromosome names and wish to keep them unchanged. Is there a way around this issue for the cnvkit.py scatter command to tolerate punctuation marks in the chromosome names? Some of the chromosome names I am using are: NC_039898.1 NC_039899.1 NC_039900.1 NC_039901.1 NC_039902.1 etc. And not: chr1 chr2 chr3 etc.
Thanks ahead of time!
-Anibal