Open lbeltrame opened 2 months ago
Wow, what is going on there? Something is not right, you shouldn't be getting anywhere near that many rows. How big is your cnv slot of qseaSet2? What window size are you using?
I'm currently at s conference but as soon as I get back (Monday) I'll let you know. FTR these are normal samples.
str(qseaSet)
is likely to be informative, if you can share that when you can.
For the window size, I used the default of makeQset
, so that means 1Mbp, which is reasonable to me.
As for the size of the cnv
slot:
batch3@cnv
GRanges object with 774656 ranges and 4 metadata columns:
seqnames ranges strand | sampleE sampleF sampleG sampleH
<Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric> <numeric>
[1] chr1 1-1000000 * | 0.09 0.05 0 0.05
[2] chr1 1-1000000 * | 0.09 0.05 0 0.05
[3] chr1 1-1000000 * | 0.09 0.05 0 0.05
[4] chr1 1-1000000 * | 0.09 0.05 0 0.05
[5] chr1 1-1000000 * | 0.09 0.05 0 0.05
... ... ... ... . ... ... ... ...
[774652] chrX 155000001-156000000 * | -0.05 -0.15 0 0
[774653] chrX 155000001-156000000 * | -0.05 -0.15 0 0
[774654] chrX 155000001-156000000 * | -0.05 -0.15 0 0
[774655] chrX 155000001-156000000 * | -0.05 -0.15 0 0
[774656] chrX 155000001-156000000 * | -0.05 -0.15 0 0
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
The structure of batch3
is as follows (this is one of the 4 objects I want to merge):
Formal class 'qseaSet' [package "qsea"] with 8 slots
..@ sampleTable :'data.frame': 4 obs. of 5 variables:
.. ..$ sample_name : chr [1:4] "sampleE" "sampleF" "sampleG" "sampleH"
.. ..$ group : chr [1:4] "PoN" "PoN" "PoN" "PoN"
.. ..$ file_name : chr [1:4] "/home/incalci/shared/runs/methylation_capture/alignment_pon_2nd/markduplicates/sampleE_markdup.chronly.bam" "/home/incalci/shared/runs/methylation_capture/alignment_pon_2nd/markduplicates/sampleF_markdup.chronly.50M.bam" "/home/incalci/shared/runs/methylation_capture/alignment_pon_2nd/markduplicates/sampleG_markdup.chronly.bam" "/home/incalci/shared/runs/methylation_capture/alignment_pon_2nd/markduplicates/sampleH_markdup.chronly.bam"
.. ..$ input_file : chr [1:4] "/home/incalci/shared/runs/methylation_capture/alignment_input/markduplicates/sampleE_markdup.chronly.5M.bam" "/home/incalci/shared/runs/methylation_capture/alignment_input/markduplicates/sampleF_markdup.chronly.5M.bam" "/home/incalci/shared/runs/methylation_capture/alignment_input/markduplicates/sampleG_markdup.chronly.5M.bam" "/home/incalci/shared/runs/methylation_capture/alignment_input/markduplicates/sampleH_markdup.chronly.5M.bam"
.. ..$ hyperStableEdgar: num [1:4] 0.542 0.538 0.323 0.582
..@ count_matrix: int [1:10103462, 1:4] 0 0 0 0 0 0 0 0 0 0 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:4] "sampleE" "sampleF" "sampleG" "sampleH"
..@ zygosity : num [1:4, 1:23] 2 2 2 2 2 2 2 2 2 2 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:4] "sampleE" "sampleF" "sampleG" "sampleH"
.. .. ..$ : chr [1:23] "chr1" "chr2" "chr3" "chr4" ...
..@ regions :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
.. .. ..@ seqnames :Formal class 'Rle' [package "S4Vectors"] with 4 slots
.. .. .. .. ..@ values : Factor w/ 23 levels "chr1","chr2",..: 1 2 3 4 5 6 7 8 9 10 ...
.. .. .. .. ..@ lengths : int [1:23] 829854 807311 660985 634048 605127 569353 531153 483795 461315 445991 ...
.. .. .. .. ..@ elementMetadata: NULL
.. .. .. .. ..@ metadata : list()
.. .. ..@ ranges :Formal class 'IRanges' [package "IRanges"] with 6 slots
.. .. .. .. ..@ start : int [1:10103462] 1 301 601 901 1201 1501 1801 2101 2401 2701 ...
.. .. .. .. ..@ width : int [1:10103462] 300 300 300 300 300 300 300 300 300 300 ...
.. .. .. .. ..@ NAMES : NULL
.. .. .. .. ..@ elementType : chr "ANY"
.. .. .. .. ..@ elementMetadata: NULL
.. .. .. .. ..@ metadata : list()
.. .. ..@ strand :Formal class 'Rle' [package "S4Vectors"] with 4 slots
.. .. .. .. ..@ values : Factor w/ 3 levels "+","-","*": 3
.. .. .. .. ..@ lengths : int 10103462
.. .. .. .. ..@ elementMetadata: NULL
.. .. .. .. ..@ metadata : list()
.. .. ..@ seqinfo :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
.. .. .. .. ..@ seqnames : chr [1:23] "chr1" "chr2" "chr3" "chr4" ...
.. .. .. .. ..@ seqlengths : int [1:23] 248956422 242193529 198295559 190214555 181538259 170805979 159345973 145138636 138394717 133797422 ...
.. .. .. .. ..@ is_circular: logi [1:23] NA NA NA NA NA NA ...
.. .. .. .. ..@ genome : chr [1:23] "hg38" "hg38" "hg38" "hg38" ...
.. .. ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
.. .. .. .. ..@ rownames : NULL
.. .. .. .. ..@ nrows : int 10103462
.. .. .. .. ..@ elementType : chr "ANY"
.. .. .. .. ..@ elementMetadata: NULL
.. .. .. .. ..@ metadata : list()
.. .. .. .. ..@ listData :List of 1
.. .. .. .. .. ..$ CpG_density: num [1:10103462] NA NA NA NA NA NA NA NA NA NA ...
.. .. ..@ elementType : chr "ANY"
.. .. ..@ metadata : list()
..@ cnv :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
.. .. ..@ seqnames :Formal class 'Rle' [package "S4Vectors"] with 4 slots
.. .. .. .. ..@ values : Factor w/ 23 levels "chr1","chr2",..: 1 2 3 4 5 6 7 8 9 10 ...
.. .. .. .. ..@ lengths : int [1:23] 63488 61952 50688 48640 46336 43520 40960 37120 35328 34048 ...
.. .. .. .. ..@ elementMetadata: NULL
.. .. .. .. ..@ metadata : list()
.. .. ..@ ranges :Formal class 'IRanges' [package "IRanges"] with 6 slots
.. .. .. .. ..@ start : int [1:774656] 1 1 1 1 1 1 1 1 1 1 ...
.. .. .. .. ..@ width : int [1:774656] 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000 ...
.. .. .. .. ..@ NAMES : NULL
.. .. .. .. ..@ elementType : chr "ANY"
.. .. .. .. ..@ elementMetadata: NULL
.. .. .. .. ..@ metadata : list()
.. .. ..@ strand :Formal class 'Rle' [package "S4Vectors"] with 4 slots
.. .. .. .. ..@ values : Factor w/ 3 levels "+","-","*": 3
.. .. .. .. ..@ lengths : int 774656
.. .. .. .. ..@ elementMetadata: NULL
.. .. .. .. ..@ metadata : list()
.. .. ..@ seqinfo :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
.. .. .. .. ..@ seqnames : chr [1:23] "chr1" "chr2" "chr3" "chr4" ...
.. .. .. .. ..@ seqlengths : int [1:23] NA NA NA NA NA NA NA NA NA NA ...
.. .. .. .. ..@ is_circular: logi [1:23] NA NA NA NA NA NA ...
.. .. .. .. ..@ genome : chr [1:23] NA NA NA NA ...
.. .. ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
.. .. .. .. ..@ rownames : NULL
.. .. .. .. ..@ nrows : int 774656
.. .. .. .. ..@ elementType : chr "ANY"
.. .. .. .. ..@ elementMetadata: NULL
.. .. .. .. ..@ metadata : list()
.. .. .. .. ..@ listData :List of 4
.. .. .. .. .. ..$ sampleE: num [1:774656] 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 ...
.. .. .. .. .. ..$ sampleF: num [1:774656] 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ...
.. .. .. .. .. ..$ sampleG: num [1:774656] 0 0 0 0 0 0 0 0 0 0 ...
.. .. .. .. .. ..$ sampleH: num [1:774656] 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ...
.. .. ..@ elementType : chr "ANY"
.. .. ..@ metadata : list()
..@ parameters :List of 13
.. ..$ window_size : num 300
.. ..$ BSgenome : chr "BSgenome.Hsapiens.UCSC.hg38"
.. ..$ minMapQual : num 10
.. ..$ minReferenceLength: num 30
.. ..$ maxInsertSize : num 1000
.. ..$ minInsertSize : num 70
.. ..$ properPairsOnly : logi FALSE
.. ..$ fragmentLength : num 167
.. ..$ fragmentSD : num 38
.. ..$ enrichmentPattern : chr "CpG"
.. ..$ enrichmentMethod : chr "blind1-15"
.. ..$ coverageMethod : chr "PairedAndR1s"
.. ..$ cnvMethod : chr "HMMdefault"
..@ libraries :List of 3
.. ..$ file_name :'data.frame': 4 obs. of 17 variables:
.. .. ..$ qsea_initial_reads : int [1:4] 63125797 49442548 27689448 74713226
.. .. ..$ qsea_initial_pairs : int [1:4] 6364110 5374423 2964141 7938650
.. .. ..$ qsea_initial_r1s : int [1:4] 577012 482749 263185 792627
.. .. ..$ qsea_mapq_filtered_pairs : int [1:4] 4201875 3574098 2078120 5376092
.. .. ..$ qsea_size_filtered_pairs : int [1:4] 4178366 3555221 2067588 5349721
.. .. ..$ qsea_filtered_r1s : int [1:4] 226613 212464 109637 366389
.. .. ..$ total_fragments : int [1:4] 4404979 3767685 2177225 5716110
.. .. ..$ valid_fragments : int [1:4] 4386971 3746208 2169445 5695692
.. .. ..$ library_factor : num [1:4] 1 1 1 1
.. .. ..$ offset : num [1:4] 0.0353 0.038 0.0433 0.0612
.. .. ..$ fragment_median : num [1:4] 199 196 292 195
.. .. ..$ fragment_mean : num [1:4] 272 266 310 267
.. .. ..$ fragment_sd : num [1:4] 139 135 158 138
.. .. ..$ relH : num [1:4] 5.04 5.09 4.7 4.73
.. .. ..$ GoGe : num [1:4] 2.47 2.47 2.38 2.4
.. .. ..$ fragments_without_pattern: int [1:4] 135410 131259 70486 289110
.. .. ..$ prop_without_pattern : num [1:4] 0.031 0.035 0.032 0.051
.. ..$ input_file :'data.frame': 4 obs. of 17 variables:
.. .. ..$ qsea_initial_reads : int [1:4] 5312141 5217320 5810327 5127925
.. .. ..$ qsea_initial_pairs : int [1:4] 1964327 2021534 1770899 2035148
.. .. ..$ qsea_initial_r1s : int [1:4] 315259 308108 257606 337601
.. .. ..$ qsea_mapq_filtered_pairs : int [1:4] 1868256 1912760 1681805 1930196
.. .. ..$ qsea_size_filtered_pairs : int [1:4] 1853202 1898174 1642107 1924345
.. .. ..$ qsea_filtered_r1s : int [1:4] 281056 273300 229709 300566
.. .. ..$ total_fragments : int [1:4] 2134258 2171474 1871816 2224911
.. .. ..$ valid_fragments : int [1:4] 2127659 2158156 1866038 2217647
.. .. ..$ library_factor : logi [1:4] NA NA NA NA
.. .. ..$ offset : logi [1:4] NA NA NA NA
.. .. ..$ fragment_median : num [1:4] 173 171 174 171
.. .. ..$ fragment_mean : num [1:4] 173 170 178 170
.. .. ..$ fragment_sd : num [1:4] 28.7 27.7 39.7 26
.. .. ..$ relH : num [1:4] 0.935 0.94 0.936 0.94
.. .. ..$ GoGe : num [1:4] 0.882 0.885 0.884 0.885
.. .. ..$ fragments_without_pattern: int [1:4] 735158 758767 637086 772484
.. .. ..$ prop_without_pattern : num [1:4] 0.344 0.349 0.34 0.347
.. ..$ input_counts:Classes ‘data.table’ and 'data.frame': 12084 obs. of 11 variables:
.. .. ..$ chr : Factor w/ 23 levels "chr1","chr2",..: 1 1 1 1 1 1 1 1 1 1 ...
.. .. ..$ start : int [1:12084] 1 1 1 1 1000001 1000001 1000001 1000001 2000001 2000001 ...
.. .. ..$ end : int [1:12084] 1000000 1000000 1000000 1000000 2000000 2000000 2000000 2000000 3000000 3000000 ...
.. .. ..$ width : int [1:12084] 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000 ...
.. .. ..$ strand : Factor w/ 3 levels "+","-","*": 3 3 3 3 3 3 3 3 3 3 ...
.. .. ..$ sampleE: int [1:12084] 233 233 233 233 562 562 562 562 587 587 ...
.. .. ..$ sampleF: int [1:12084] 231 231 231 231 637 637 637 637 685 685 ...
.. .. ..$ sampleG: int [1:12084] 159 159 159 159 506 506 506 506 527 527 ...
.. .. ..$ sampleH: int [1:12084] 226 226 226 226 584 584 584 584 672 672 ...
.. .. ..$ gc : num [1:12084] -1 -1 -1 -1 0.605 ...
.. .. ..$ map : num [1:12084] 0.0951 0.3753 0.0951 0.3753 0.9309 ...
.. .. ..- attr(*, ".internal.selfref")=<externalptr>
..@ enrichment :List of 5
.. ..$ parameters : num [1:4, 1:3] 3.84 3.91 3.86 3.79 27.53 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:4] "sampleE" "sampleF" "sampleG" "sampleH"
.. .. .. ..$ : chr [1:3] "a" "b" "c"
.. ..$ factors : num [1:80, 1:4] NA -0.0467 -0.0361 -0.0112 0.035 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : NULL
.. .. .. ..$ : chr [1:4] "sampleE" "sampleF" "sampleG" "sampleH"
.. ..$ density : num [1:80] NA 1.21 1.71 2.22 2.73 ...
.. ..$ n : num [1:80] 0 1636171 1091653 730933 511531 ...
.. ..$ pattern_name: chr "CpG"
I noticed that there are loads of duplicated ranges in the CNV slot, is this normal?
unique(batch3@cnv)
GRanges object with 3021 ranges and 4 metadata columns:
seqnames ranges strand | sampleE sampleF sampleG sampleH
<Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric> <numeric>
[1] chr1 1-1000000 * | 0.09 0.05 0 0.05
[2] chr1 1000001-2000000 * | 0.09 0.05 0 0.05
[3] chr1 2000001-3000000 * | 0.09 0.05 0 0.05
[4] chr1 3000001-4000000 * | 0.09 0.05 0 0.05
[5] chr1 4000001-5000000 * | 0.09 0.05 0 0.05
... ... ... ... . ... ... ... ...
[3017] chrX 151000001-152000000 * | -0.05 -0.15 0 0
[3018] chrX 152000001-153000000 * | -0.05 -0.15 0 0
[3019] chrX 153000001-154000000 * | -0.05 -0.15 0 0
[3020] chrX 154000001-155000000 * | -0.05 -0.15 0 0
[3021] chrX 155000001-156000000 * | -0.05 -0.15 0 0
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
EDIT: If I remove the duplicated CNV regions, I can merge just fine:
r$> batch3@cnv = unique(batch3@cnv)
r$> batch2 = readRDS("PoN_batch_5-8.rds")
r$> batch2@cnv = unique(batch2@cnv)
r$> combineQsets(batch2, batch3)
Joining with `by = join_by(seqnames, start, end, width, strand, CpG_density)`
qseaSet
=======================================
8 Samples:
sampleA, sampleB, sampleC, sampleD, sampleE, sampleF, sampleG, sampleH
10103462 Regions in 23 chromosomes of BSgenome.Hsapiens.UCSC.hg38
I noticed that there are loads of duplicated ranges in the CNV slot, is this normal?
Nope, that is exactly what I suspected, it shouldn't have any duplicated regions there. So we need to work out where they are coming from upstream.
Where can I debug this? (In the mean time, I'll deduplicate the cnv slot to progress further)
Where can I debug this? (In the mean time, I'll deduplicate the cnv slot to progress further)
Can you share with me your command that you used for makeQset
please.
And what branch you are using I guess? If you are using a modified one with multiple of the fixes can you share that somewhere? @Steven-M-Hill is supposedly going to be looking at the PRs this week....
Currently I run the vignette
branch. I didn't yet switch to use a patched one (I wanted to review every change before charging in blindly...).
I used this command to make the qSet:
qseaSet <- makeQset(table, BSgenome = "BSgenome.Hsapiens.UCSC.hg38",
windowSize = 300, fragmentType = "cfDNA", CNVmethod = "HMMdefault",
chrSelect = c(paste0("chr",1:22), "chrX"),
hmmCopyGC = xt_gc_hg38_500kb,
hmmCopyMap = xt_map_hg38)
The xt_*
GC and mappability are simply the same objects as in mesa, but in UCSC notation. table
includes samples and their corresponding input.
And your xt_gc_hg38_500kb
and xt_map_hg38
objects, how many rows/columns are they?
xt_gc_hg38_500kb suggests it is a 500kb size, not the 1Mb default...
There are about 6K rows for each object. Using the 500kbp variant may be have been an oversight as that's the bin size I use for CNV analysis on ctDNA. To make sure this wasn't the cause I'm running an analysis with the 1Mbp variants. I'll update once done.
I added chr
fairly naively there with
gc_hg38_1000kb[, `chr` := paste0("chr", `chr`)]
.... Well, it looks there aren't duplicates now. So I guess it was an incompatible bin size: Quite a silent error though because it went through unnoticed. I assume there's no way to detect it short of checking the size of the first range? Because this can be problematic and it was completely silent.
Yeah, need to add some checks for that then.
I'll edit the code and add some checks.
If you want to use 500kb bins, there is no reason you can't, just change the CNVwindowSize
parameter.
The code attempts a
full join
here:https://github.com/cruk-mi/mesa/blob/212a791cb4da2d2d2c099c4ef53b4b71d5f1ee65/R/combineQsets.R#L193
But, when joining large
qseaSets
, the result is too much for dplyr to handle: