Open moldach opened 4 years ago
Hi @moldach ,
loadCopyNumberCallsCNVkit() is able to load all cnvkit data (.cnr and .cns).
Regarding the long time trying to load the .cns file, I think that it is because of the long list of genes annotated by region. Try to run CNVkit again without annotation and then to reload the .cns file.
Secondly, the .cnr file is correctly loaded. The baf column is given in the .cns file. The lrr column is named as segment.value. This is a mislabeling problem that we have to solve in this function.
So, to sum up, you should run again CNVkit without annotation and reload de .cns file with the function. In addition, you must remember that the lrr column is named as segment.value.
Thanks for the tip @bernatgel.
Now that it's working I have another question about labelling. From the Vignette:
library(CopyNumberPlots)
s1.calls.file <- system.file("extdata", "S1.segments.txt", package = "CopyNumberPlots", mustWork = TRUE)
s1.calls <- loadCopyNumberCalls(s1.calls.file)
kp <- plotKaryotype(chromosomes="chr1")
plotCopyNumberCalls(kp, s1.calls, r0=0, r1=0.10)
Looks like:
s1.calls
GRanges object with 13 ranges and 2 metadata columns:
seqnames ranges strand | cn loh
<Rle> <IRanges> <Rle> | <integer> <integer>
1 chr1 1-60000000 * | 1 1
2 chr1 60000001-60000999 * | 2 0
3 chr1 60001000-62990000 * | 0 1
4 chr1 62990001-62999999 * | 2 0
5 chr1 63000000-121500000 * | 1 1
.. ... ... ... . ... ...
9 chr1 189600352-220352872 * | 3 0
10 chr1 220352873-220352971 * | 2 0
11 chr1 220352972-234920000 * | 5 0
12 chr1 234920001-234999999 * | 2 0
13 chr1 235000000-249250621 * | 3 0
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
When I load my data:
s2.calls <- CopyNumberPlots::loadCopyNumberCallsCNVkit("data/470.sorted.dedupped.cns")
I see the following:
s2.calls
GRanges object with 12 ranges and 7 metadata columns:
seqnames ranges strand | gene segment.value depth probes weight
<Rle> <IRanges> <Rle> | <character> <numeric> <numeric> <integer> <numeric>
1 X 0-17718942 * | - NA 65.0631 3500 3426.3700
2 I 0-15072434 * | - NA 69.0837 2771 2808.6200
3 V 0-2349907 * | - NA 69.1714 453 454.3200
4 V 2349907-2564899 * | - -1.01244 30.6558 43 41.8325
5 V 2564899-15299400 * | - NA 67.0854 2510 2458.3900
.. ... ... ... . ... ... ... ... ...
8 II 0-15279421 * | - NA 65.2653 2899 2897.14000
9 III 0-7404355 * | - NA 69.0932 1364 1390.20000
10 III 7404355-7454351 * | - 2.06886 290.4790 10 9.72848
11 III 7454351-13783801 * | - NA 65.6806 1193 1186.87000
12 IV 0-17493829 * | - NA 69.0006 3037 3074.20000
ci_lo ci_hi
<numeric> <numeric>
1 NA NA
2 NA NA
3 NA NA
4 -1.03255 -0.993146
5 NA NA
.. ... ...
8 NA NA
9 NA NA
10 1.92209 2.14312
11 NA NA
12 NA NA
-------
seqinfo: 6 sequences from an unspecified genome; no seqlengths
What corresponds to cn
and loh
in s1.calls
?
Trying this on my data produces an error:
s2.calls <- CopyNumberPlots::loadCopyNumberCallsCNVkit("data/470.sorted.dedupped.cns")
custom.genome <- toGRanges(data.frame(chr=c("I", "II", "III", "IV", "V", "X", "MtDNA"), start=c(1, 1, 1, 1, 1, 1, 1), end=c(15072434, 15279421, 13783801, 17493829, 20924180, 17718942, 13794)))
kp <- plotKaryotype(genome = custom.genome)
plotCopyNumberCalls(kp, s2.calls, r0=0, r1=0.10)
Error in plotCopyNumberCalls(kp, s2.calls, r0 = 0, r1 = 0.1) :
The cn.calls object does not have a column cn. No copy number data is available
Not sure if there is a lack of copy number data (if so why?) or a mislabelling issue.
Hi @moldach
The .cns file you have loaded doesn't contain the cn data.
Firstly, you have to take into account CNVkit gives the results in different files. In .cnr file, you will find the lrr information. Here is where you will find the mislabelling after loading this type of file with loadCopyNumberCallsCNVkit(). We are considering developing a function to specifically load the CNVkit's .cnr files.
Another important thing you have to know concerning CNVkit is that you obtain two different .cns files. The first .cns file is the one you get after running the segment function. In this file, you won't have the cn information. You will have the segment.value information. This means that when you load .cns files with loadCopyNumberCallsCNVkit() the column named as segment.value will be correct.
The last thing you have to consider is that if you want to obtain the cn information in your .cns file you have to run the function call of CNVkit. That's why you don't have this information loaded now.
So, you can load all type of CNVkit data with loadCopyNumberCallsCNVkit(). The issue here is that you have to know what is contained in each CNVkit file.
Hi @miriammagallon.
I ran the following in cnvkit
: cnvkit.py batch N2_trim_bwaMEM_sort_dedupped.bam -n -m wgs -f /scratch/moldach/data/references/c_elegans.PRJNA13758.WS265.genomic.fa
which produced the following files
base) mtg@mtg-ThinkPad-P53:~/projects/data/celegans$ ll
total 7668364
drwxrwxr-x 6 mtg mtg 4096 May 20 16:33 ./
drwxrwxr-x 3 mtg mtg 4096 May 1 15:43 ../
-rw-rw-r-- 1 mtg mtg 31 May 20 16:33 470.sorted.dedupped.antitargetcoverage.cnn
-rw-r----- 1 mtg mtg 5352369367 May 19 23:47 470.sorted.dedupped.bam
-rw-rw-r-- 1 mtg mtg 307672 May 20 08:47 470.sorted.dedupped.bam.bai
-rw-rw-r-- 1 mtg mtg 54 May 20 16:33 470.sorted.dedupped.bintest.cns
-rw-rw-r-- 1 mtg mtg 595 May 20 16:33 470.sorted.dedupped.call.cns
-rw-rw-r-- 1 mtg mtg 922610 May 20 16:33 470.sorted.dedupped.cnr
-rw-rw-r-- 1 mtg mtg 609 May 20 16:33 470.sorted.dedupped.cns
-rw-rw-r-- 1 mtg mtg 741769 May 20 16:33 470.sorted.dedupped.targetcoverage.cnn
-rw-rw-r-- 1 mtg mtg 0 May 20 08:46 c_elegans.PRJEB28388.WS274.genomic.antitarget.bed
-rw-rw-r-- 1 mtg mtg 136 May 20 08:46 c_elegans.PRJEB28388.WS274.genomic.bed
-rw-rw-r-- 1 mtg mtg 618515 May 20 08:46 c_elegans.PRJEB28388.WS274.genomic.target.bed
-rw-rw-r-- 1 mtg mtg 0 May 20 16:30 c_elegans.PRJNA13758.WS265.genomic.antitarget.bed
-rw-rw-r-- 1 mtg mtg 96 May 20 16:30 c_elegans.PRJNA13758.WS265.genomic.bed
-rw-rw-r-- 1 mtg mtg 101957874 May 20 08:55 c_elegans.PRJNA13758.WS265.genomic.fa
-rw-rw-r-- 1 mtg mtg 181 May 20 08:57 c_elegans.PRJNA13758.WS265.genomic.fa.fai
-rw-rw-r-- 1 mtg mtg 426633 May 20 16:30 c_elegans.PRJNA13758.WS265.genomic.target.bed
drwxrwxr-x 2 mtg mtg 4096 May 19 18:58 MADDOG/
-rwxr-x--- 1 mtg mtg 2391634258 May 19 22:52 N2_trim_bwaMEM_sort_dedupped.bam*
drwxrws--- 2 mtg mtg 4096 May 1 16:43 PRJEB28388/
-rw-rw-r-- 1 mtg mtg 939345 May 20 13:46 pyenv.log
-rw-rw-r-- 1 mtg mtg 738490 May 20 16:30 reference.cnn
-rw-rw-r-- 1 mtg mtg 738490 May 20 08:57 reference.cnn.1
-rw-rw-r-- 1 mtg mtg 738490 May 20 16:25 reference.cnn.2
drwxrwxr-x 6 mtg mtg 4096 May 20 16:18 venv/
drwxrwx--- 3 mtg mtg 4096 May 1 19:47 WS265_wormbase/
Loading the .cns
sample with the copy numbers looks like this:
> s2.calls <- CopyNumberPlots::loadCopyNumberCallsCNVkit("data/470.sorted.dedupped.call.cns")
> s2.calls
GRanges object with 10 ranges and 7 metadata columns:
seqnames ranges strand | gene segment.value
<Rle> <IRanges> <Rle> | <character> <numeric>
1 X 0-17718942 * | - NA
2 I 0-15072434 * | - NA
3 V 0-2349907 * | - NA
4 V 2349907-2564899 * | - -1.54065
5 V 2564899-20924180 * | - NA
6 II 0-15279421 * | - NA
7 III 0-7404355 * | - NA
8 III 7404355-7454351 * | - 1.54065
9 III 7454351-13783801 * | - NA
10 IV 0-17493829 * | - NA
cn depth p_ttest probes weight
<integer> <numeric> <numeric> <integer> <numeric>
1 2 65.0631 5.05544e-24 3500 3426.37000
2 2 69.0837 9.63092e-10 2771 2808.62000
3 2 69.1714 9.22086e-01 453 454.32000
4 0 30.6558 2.05389e-34 43 41.83250
5 2 70.3195 1.32293e-03 3551 3506.15000
6 2 65.2653 3.02552e-01 2899 2897.14000
7 2 69.0932 8.75159e-07 1364 1390.20000
8 6 290.4790 1.18285e-06 10 9.72848
9 2 65.6806 2.89004e-01 1193 1186.87000
10 2 69.0006 5.70572e-02 3037 3074.20000
-------
seqinfo: 6 sequences from an unspecified genome; no seqlengths
Okay so I can see that this file has cn
and segment.value
but [according to section 5.4 of the vignette] (https://bioconductor.org/packages/release/bioc/vignettes/CopyNumberPlots/inst/doc/CopyNumberPlots.html) you need
a GRanges object with a at least one column of: “cn” for integer copy number calls “segment.value” for non-integer segment regional values * “loh” a logical for loss-of-heterozygosity
Where is loh
found? If it's in another file how am I supposed to merge them? For example, the following doesn't work on Granges
:
dplyr::inner_join(s2.calls,s3.calls) Error in UseMethod("inner_join") : no applicable method for 'inner_join' applied to an object of class "c('GRanges', 'GenomicRanges', 'GRanges_OR_NULL', 'Ranges', 'GenomicRanges_OR_missing', 'GenomicRanges_OR_GenomicRangesList', 'GenomicRanges_OR_GRangesList', 'List', 'Vector', 'list_OR_List', 'Annotated', 'vector_OR_Vector')"
Hi @moldach,
CNVkit has not generated the LOH column but you don't need this information to use CopyNumberPlots. You can already plot your data with CopyNumberPlots functions. If you want to get the LOH information you should take a look at CNVkit documentation or write to the developers.
I'm having a problem with
loadCopyNumberCallsCNVkit()
. Which output file fromCNVkit
is it supposed to accept?Trying the
.cns
file doesn't produce an error (or output) after ~25 minutes.Using the
.cnr
file loads withseqnames
,ranges
andstrand
; however, there is nolrr
orbaf
in the metadata and it looks weird:Any idea what's the problem?