Adding GC and mappability with incompatible bin sizes massively duplicates CNV regions in the resulting `qseaSet`

lbeltrame commented 2 months ago

The code attempts a full join here:

https://github.com/cruk-mi/mesa/blob/212a791cb4da2d2d2c099c4ef53b4b71d5f1ee65/R/combineQsets.R#L193

But, when joining large qseaSets, the result is too much for dplyr to handle:

r$> combineQsetsList(list(batch1, batch2, batch3, batch4))
No initial qseaSet given, using first element as initial qseaSet
Joining with `by = join_by(seqnames, start, end, width, strand, CpG_density)`                                                    
Joining with `by = join_by(seqnames, start, end, width, strand, CpG_density)`                                                    
Error in `dplyr::full_join()`:
! This join would result in more rows than dplyr can handle.
ℹ 50985959424 rows would be returned. 2147483647 rows is the maximum number allowed.                                             
ℹ Double check your join keys. This error commonly occurs due to a missing join key, or an improperly specified join condition.  
Run `rlang::last_trace()` to see where the error occurred.

SPPearce commented 2 months ago

Wow, what is going on there? Something is not right, you shouldn't be getting anywhere near that many rows. How big is your cnv slot of qseaSet2? What window size are you using?

lbeltrame commented 2 months ago

I'm currently at s conference but as soon as I get back (Monday) I'll let you know. FTR these are normal samples.

SPPearce commented 1 month ago

str(qseaSet) is likely to be informative, if you can share that when you can.

lbeltrame commented 1 month ago

For the window size, I used the default of makeQset, so that means 1Mbp, which is reasonable to me.

As for the size of the cnv slot:

batch3@cnv
GRanges object with 774656 ranges and 4 metadata columns:
           seqnames              ranges strand |   sampleE   sampleF   sampleG   sampleH
              <Rle>           <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
       [1]     chr1           1-1000000      * |      0.09      0.05         0      0.05
       [2]     chr1           1-1000000      * |      0.09      0.05         0      0.05
       [3]     chr1           1-1000000      * |      0.09      0.05         0      0.05
       [4]     chr1           1-1000000      * |      0.09      0.05         0      0.05
       [5]     chr1           1-1000000      * |      0.09      0.05         0      0.05
       ...      ...                 ...    ... .       ...       ...       ...       ...
  [774652]     chrX 155000001-156000000      * |     -0.05     -0.15         0         0
  [774653]     chrX 155000001-156000000      * |     -0.05     -0.15         0         0
  [774654]     chrX 155000001-156000000      * |     -0.05     -0.15         0         0
  [774655]     chrX 155000001-156000000      * |     -0.05     -0.15         0         0
  [774656]     chrX 155000001-156000000      * |     -0.05     -0.15         0         0
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

The structure of batch3 is as follows (this is one of the 4 objects I want to merge):

Formal class 'qseaSet' [package "qsea"] with 8 slots
  ..@ sampleTable :'data.frame':    4 obs. of  5 variables:
  .. ..$ sample_name     : chr [1:4] "sampleE" "sampleF" "sampleG" "sampleH"
  .. ..$ group           : chr [1:4] "PoN" "PoN" "PoN" "PoN"
  .. ..$ file_name       : chr [1:4] "/home/incalci/shared/runs/methylation_capture/alignment_pon_2nd/markduplicates/sampleE_markdup.chronly.bam" "/home/incalci/shared/runs/methylation_capture/alignment_pon_2nd/markduplicates/sampleF_markdup.chronly.50M.bam" "/home/incalci/shared/runs/methylation_capture/alignment_pon_2nd/markduplicates/sampleG_markdup.chronly.bam" "/home/incalci/shared/runs/methylation_capture/alignment_pon_2nd/markduplicates/sampleH_markdup.chronly.bam"
  .. ..$ input_file      : chr [1:4] "/home/incalci/shared/runs/methylation_capture/alignment_input/markduplicates/sampleE_markdup.chronly.5M.bam" "/home/incalci/shared/runs/methylation_capture/alignment_input/markduplicates/sampleF_markdup.chronly.5M.bam" "/home/incalci/shared/runs/methylation_capture/alignment_input/markduplicates/sampleG_markdup.chronly.5M.bam" "/home/incalci/shared/runs/methylation_capture/alignment_input/markduplicates/sampleH_markdup.chronly.5M.bam"
  .. ..$ hyperStableEdgar: num [1:4] 0.542 0.538 0.323 0.582
  ..@ count_matrix: int [1:10103462, 1:4] 0 0 0 0 0 0 0 0 0 0 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr [1:4] "sampleE" "sampleF" "sampleG" "sampleH"
  ..@ zygosity    : num [1:4, 1:23] 2 2 2 2 2 2 2 2 2 2 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:4] "sampleE" "sampleF" "sampleG" "sampleH"
  .. .. ..$ : chr [1:23] "chr1" "chr2" "chr3" "chr4" ...
  ..@ regions     :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
  .. .. ..@ seqnames       :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. .. .. ..@ values         : Factor w/ 23 levels "chr1","chr2",..: 1 2 3 4 5 6 7 8 9 10 ...
  .. .. .. .. ..@ lengths        : int [1:23] 829854 807311 660985 634048 605127 569353 531153 483795 461315 445991 ...
  .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. ..@ metadata       : list()
  .. .. ..@ ranges         :Formal class 'IRanges' [package "IRanges"] with 6 slots
  .. .. .. .. ..@ start          : int [1:10103462] 1 301 601 901 1201 1501 1801 2101 2401 2701 ...
  .. .. .. .. ..@ width          : int [1:10103462] 300 300 300 300 300 300 300 300 300 300 ...
  .. .. .. .. ..@ NAMES          : NULL
  .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. ..@ metadata       : list()
  .. .. ..@ strand         :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. .. .. ..@ values         : Factor w/ 3 levels "+","-","*": 3
  .. .. .. .. ..@ lengths        : int 10103462
  .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. ..@ metadata       : list()
  .. .. ..@ seqinfo        :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
  .. .. .. .. ..@ seqnames   : chr [1:23] "chr1" "chr2" "chr3" "chr4" ...
  .. .. .. .. ..@ seqlengths : int [1:23] 248956422 242193529 198295559 190214555 181538259 170805979 159345973 145138636 138394717 133797422 ...
  .. .. .. .. ..@ is_circular: logi [1:23] NA NA NA NA NA NA ...
  .. .. .. .. ..@ genome     : chr [1:23] "hg38" "hg38" "hg38" "hg38" ...
  .. .. ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. .. .. ..@ rownames       : NULL
  .. .. .. .. ..@ nrows          : int 10103462
  .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. ..@ metadata       : list()
  .. .. .. .. ..@ listData       :List of 1
  .. .. .. .. .. ..$ CpG_density: num [1:10103462] NA NA NA NA NA NA NA NA NA NA ...
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ metadata       : list()
  ..@ cnv         :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
  .. .. ..@ seqnames       :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. .. .. ..@ values         : Factor w/ 23 levels "chr1","chr2",..: 1 2 3 4 5 6 7 8 9 10 ...
  .. .. .. .. ..@ lengths        : int [1:23] 63488 61952 50688 48640 46336 43520 40960 37120 35328 34048 ...
  .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. ..@ metadata       : list()
  .. .. ..@ ranges         :Formal class 'IRanges' [package "IRanges"] with 6 slots
  .. .. .. .. ..@ start          : int [1:774656] 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..@ width          : int [1:774656] 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000 ...
  .. .. .. .. ..@ NAMES          : NULL
  .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. ..@ metadata       : list()
  .. .. ..@ strand         :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. .. .. ..@ values         : Factor w/ 3 levels "+","-","*": 3
  .. .. .. .. ..@ lengths        : int 774656
  .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. ..@ metadata       : list()
  .. .. ..@ seqinfo        :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
  .. .. .. .. ..@ seqnames   : chr [1:23] "chr1" "chr2" "chr3" "chr4" ...
  .. .. .. .. ..@ seqlengths : int [1:23] NA NA NA NA NA NA NA NA NA NA ...
  .. .. .. .. ..@ is_circular: logi [1:23] NA NA NA NA NA NA ...
  .. .. .. .. ..@ genome     : chr [1:23] NA NA NA NA ...
  .. .. ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. .. .. ..@ rownames       : NULL
  .. .. .. .. ..@ nrows          : int 774656
  .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. ..@ metadata       : list()
  .. .. .. .. ..@ listData       :List of 4
  .. .. .. .. .. ..$ sampleE: num [1:774656] 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 ...
  .. .. .. .. .. ..$ sampleF: num [1:774656] 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ...
  .. .. .. .. .. ..$ sampleG: num [1:774656] 0 0 0 0 0 0 0 0 0 0 ...
  .. .. .. .. .. ..$ sampleH: num [1:774656] 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ...
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ metadata       : list()
  ..@ parameters  :List of 13
  .. ..$ window_size       : num 300
  .. ..$ BSgenome          : chr "BSgenome.Hsapiens.UCSC.hg38"
  .. ..$ minMapQual        : num 10
  .. ..$ minReferenceLength: num 30
  .. ..$ maxInsertSize     : num 1000
  .. ..$ minInsertSize     : num 70
  .. ..$ properPairsOnly   : logi FALSE
  .. ..$ fragmentLength    : num 167
  .. ..$ fragmentSD        : num 38
  .. ..$ enrichmentPattern : chr "CpG"
  .. ..$ enrichmentMethod  : chr "blind1-15"
  .. ..$ coverageMethod    : chr "PairedAndR1s"
  .. ..$ cnvMethod         : chr "HMMdefault"
  ..@ libraries   :List of 3
  .. ..$ file_name   :'data.frame': 4 obs. of  17 variables:
  .. .. ..$ qsea_initial_reads       : int [1:4] 63125797 49442548 27689448 74713226
  .. .. ..$ qsea_initial_pairs       : int [1:4] 6364110 5374423 2964141 7938650
  .. .. ..$ qsea_initial_r1s         : int [1:4] 577012 482749 263185 792627
  .. .. ..$ qsea_mapq_filtered_pairs : int [1:4] 4201875 3574098 2078120 5376092
  .. .. ..$ qsea_size_filtered_pairs : int [1:4] 4178366 3555221 2067588 5349721
  .. .. ..$ qsea_filtered_r1s        : int [1:4] 226613 212464 109637 366389
  .. .. ..$ total_fragments          : int [1:4] 4404979 3767685 2177225 5716110
  .. .. ..$ valid_fragments          : int [1:4] 4386971 3746208 2169445 5695692
  .. .. ..$ library_factor           : num [1:4] 1 1 1 1
  .. .. ..$ offset                   : num [1:4] 0.0353 0.038 0.0433 0.0612
  .. .. ..$ fragment_median          : num [1:4] 199 196 292 195
  .. .. ..$ fragment_mean            : num [1:4] 272 266 310 267
  .. .. ..$ fragment_sd              : num [1:4] 139 135 158 138
  .. .. ..$ relH                     : num [1:4] 5.04 5.09 4.7 4.73
  .. .. ..$ GoGe                     : num [1:4] 2.47 2.47 2.38 2.4
  .. .. ..$ fragments_without_pattern: int [1:4] 135410 131259 70486 289110
  .. .. ..$ prop_without_pattern     : num [1:4] 0.031 0.035 0.032 0.051
  .. ..$ input_file  :'data.frame': 4 obs. of  17 variables:
  .. .. ..$ qsea_initial_reads       : int [1:4] 5312141 5217320 5810327 5127925
  .. .. ..$ qsea_initial_pairs       : int [1:4] 1964327 2021534 1770899 2035148
  .. .. ..$ qsea_initial_r1s         : int [1:4] 315259 308108 257606 337601
  .. .. ..$ qsea_mapq_filtered_pairs : int [1:4] 1868256 1912760 1681805 1930196
  .. .. ..$ qsea_size_filtered_pairs : int [1:4] 1853202 1898174 1642107 1924345
  .. .. ..$ qsea_filtered_r1s        : int [1:4] 281056 273300 229709 300566
  .. .. ..$ total_fragments          : int [1:4] 2134258 2171474 1871816 2224911
  .. .. ..$ valid_fragments          : int [1:4] 2127659 2158156 1866038 2217647
  .. .. ..$ library_factor           : logi [1:4] NA NA NA NA
  .. .. ..$ offset                   : logi [1:4] NA NA NA NA
  .. .. ..$ fragment_median          : num [1:4] 173 171 174 171
  .. .. ..$ fragment_mean            : num [1:4] 173 170 178 170
  .. .. ..$ fragment_sd              : num [1:4] 28.7 27.7 39.7 26
  .. .. ..$ relH                     : num [1:4] 0.935 0.94 0.936 0.94
  .. .. ..$ GoGe                     : num [1:4] 0.882 0.885 0.884 0.885
  .. .. ..$ fragments_without_pattern: int [1:4] 735158 758767 637086 772484
  .. .. ..$ prop_without_pattern     : num [1:4] 0.344 0.349 0.34 0.347
  .. ..$ input_counts:Classes ‘data.table’ and 'data.frame':    12084 obs. of  11 variables:
  .. .. ..$ chr    : Factor w/ 23 levels "chr1","chr2",..: 1 1 1 1 1 1 1 1 1 1 ...
  .. .. ..$ start  : int [1:12084] 1 1 1 1 1000001 1000001 1000001 1000001 2000001 2000001 ...
  .. .. ..$ end    : int [1:12084] 1000000 1000000 1000000 1000000 2000000 2000000 2000000 2000000 3000000 3000000 ...
  .. .. ..$ width  : int [1:12084] 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000 1000000 ...
  .. .. ..$ strand : Factor w/ 3 levels "+","-","*": 3 3 3 3 3 3 3 3 3 3 ...
  .. .. ..$ sampleE: int [1:12084] 233 233 233 233 562 562 562 562 587 587 ...
  .. .. ..$ sampleF: int [1:12084] 231 231 231 231 637 637 637 637 685 685 ...
  .. .. ..$ sampleG: int [1:12084] 159 159 159 159 506 506 506 506 527 527 ...
  .. .. ..$ sampleH: int [1:12084] 226 226 226 226 584 584 584 584 672 672 ...
  .. .. ..$ gc     : num [1:12084] -1 -1 -1 -1 0.605 ...
  .. .. ..$ map    : num [1:12084] 0.0951 0.3753 0.0951 0.3753 0.9309 ...
  .. .. ..- attr(*, ".internal.selfref")=<externalptr> 
  ..@ enrichment  :List of 5
  .. ..$ parameters  : num [1:4, 1:3] 3.84 3.91 3.86 3.79 27.53 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:4] "sampleE" "sampleF" "sampleG" "sampleH"
  .. .. .. ..$ : chr [1:3] "a" "b" "c"
  .. ..$ factors     : num [1:80, 1:4] NA -0.0467 -0.0361 -0.0112 0.035 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : NULL
  .. .. .. ..$ : chr [1:4] "sampleE" "sampleF" "sampleG" "sampleH"
  .. ..$ density     : num [1:80] NA 1.21 1.71 2.22 2.73 ...
  .. ..$ n           : num [1:80] 0 1636171 1091653 730933 511531 ...
  .. ..$ pattern_name: chr "CpG"

lbeltrame commented 1 month ago

I noticed that there are loads of duplicated ranges in the CNV slot, is this normal?

unique(batch3@cnv)
GRanges object with 3021 ranges and 4 metadata columns:
         seqnames              ranges strand |   sampleE   sampleF   sampleG   sampleH
            <Rle>           <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
     [1]     chr1           1-1000000      * |      0.09      0.05         0      0.05
     [2]     chr1     1000001-2000000      * |      0.09      0.05         0      0.05
     [3]     chr1     2000001-3000000      * |      0.09      0.05         0      0.05
     [4]     chr1     3000001-4000000      * |      0.09      0.05         0      0.05
     [5]     chr1     4000001-5000000      * |      0.09      0.05         0      0.05
     ...      ...                 ...    ... .       ...       ...       ...       ...
  [3017]     chrX 151000001-152000000      * |     -0.05     -0.15         0         0
  [3018]     chrX 152000001-153000000      * |     -0.05     -0.15         0         0
  [3019]     chrX 153000001-154000000      * |     -0.05     -0.15         0         0
  [3020]     chrX 154000001-155000000      * |     -0.05     -0.15         0         0
  [3021]     chrX 155000001-156000000      * |     -0.05     -0.15         0         0
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

EDIT: If I remove the duplicated CNV regions, I can merge just fine:

r$> batch3@cnv = unique(batch3@cnv)

r$> batch2 = readRDS("PoN_batch_5-8.rds")

r$> batch2@cnv = unique(batch2@cnv)

r$> combineQsets(batch2, batch3)
Joining with `by = join_by(seqnames, start, end, width, strand, CpG_density)`
qseaSet
=======================================
8 Samples: 
sampleA, sampleB, sampleC, sampleD, sampleE, sampleF, sampleG, sampleH
10103462 Regions in 23 chromosomes of BSgenome.Hsapiens.UCSC.hg38

SPPearce commented 1 month ago

I noticed that there are loads of duplicated ranges in the CNV slot, is this normal?

Nope, that is exactly what I suspected, it shouldn't have any duplicated regions there. So we need to work out where they are coming from upstream.

lbeltrame commented 1 month ago

Where can I debug this? (In the mean time, I'll deduplicate the cnv slot to progress further)

SPPearce commented 1 month ago

Where can I debug this? (In the mean time, I'll deduplicate the cnv slot to progress further)

Can you share with me your command that you used for makeQset please.

SPPearce commented 1 month ago

And what branch you are using I guess? If you are using a modified one with multiple of the fixes can you share that somewhere? @Steven-M-Hill is supposedly going to be looking at the PRs this week....

lbeltrame commented 1 month ago

Currently I run the vignette branch. I didn't yet switch to use a patched one (I wanted to review every change before charging in blindly...).

I used this command to make the qSet:

qseaSet <- makeQset(table,  BSgenome = "BSgenome.Hsapiens.UCSC.hg38", 
    windowSize = 300, fragmentType = "cfDNA", CNVmethod = "HMMdefault", 
    chrSelect = c(paste0("chr",1:22), "chrX"), 
    hmmCopyGC = xt_gc_hg38_500kb, 
    hmmCopyMap = xt_map_hg38)

The xt_* GC and mappability are simply the same objects as in mesa, but in UCSC notation. table includes samples and their corresponding input.

SPPearce commented 1 month ago

And your xt_gc_hg38_500kb and xt_map_hg38 objects, how many rows/columns are they? xt_gc_hg38_500kb suggests it is a 500kb size, not the 1Mb default...

lbeltrame commented 1 month ago

There are about 6K rows for each object. Using the 500kbp variant may be have been an oversight as that's the bin size I use for CNV analysis on ctDNA. To make sure this wasn't the cause I'm running an analysis with the 1Mbp variants. I'll update once done.

I added chr fairly naively there with

gc_hg38_1000kb[, `chr` := paste0("chr", `chr`)]

lbeltrame commented 1 month ago

.... Well, it looks there aren't duplicates now. So I guess it was an incompatible bin size: Quite a silent error though because it went through unnoticed. I assume there's no way to detect it short of checking the size of the first range? Because this can be problematic and it was completely silent.

SPPearce commented 1 month ago

Yeah, need to add some checks for that then. I'll edit the code and add some checks. If you want to use 500kb bins, there is no reason you can't, just change the CNVwindowSize parameter.

cruk-mi / mesa

Adding GC and mappability with incompatible bin sizes massively duplicates CNV regions in the resulting `qseaSet` #26