findOverlaps appears inconsistent

cb4github commented 1 year ago

Describe the bug Empty result from findOverlaps(GRanges, Granges) appears to disagree with a non-empty Granges result obtained from findOverlaps(GRanges, GRangesList)

To Reproduce build test subject

> testSubject <- fread("group group_name seqnames     start       end width strand tx_id     tx_name
+ 1       <NA>     chr5 139125308 139127808  2501      - 25443 NR_105052.2
+ 1       <NA>     chr5 139125308 139127808  2501      - 25444 NR_105053.2")
> testSubject <- GRangesList(makeGRangesFromDataFrame(as.data.frame(testSubject)), compress = F)
> testSubject
GRangesList object of length 1:
[[1]]
GRanges object with 2 ranges and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     chr5 139125308-139127808      -
  [2]     chr5 139125308-139127808      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

build test query

> testQuery <- fread("  seqnames     start       end width strand    pmd score unused.1 unused.2   color
+ chr5 139127771 139128310   540      * 0.2843     0        0        0 255,0,0")
> testQuery <- makeGRangesFromDataFrame(testQuery)
> testQuery
GRanges object with 1 range and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     chr5 139127771-139128310      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

find first hit in GRangesList...

> result1 <- findOverlaps(testQuery, testSubject, minoverlap = 50, select = "first")
> result1
[1] 1

...yet no first hit in corresponding GRanges item

> result2 <- findOverlaps(testQuery, testSubject[[result1]], minoverlap = 50, select = "first")
> result2
[1] NA

Observe intersection widths are individually <50 yet their sum 38+38=76 > 50. Is this the desired behavior? Please let me know if I can provide more info, thanks.

> width(pintersect(rep(x = testQuery, times=2), testSubject[[1]]))
[1] 38 38

Expected behavior result2 should be the first entry in testSubject[[1]]

> testSubject[[1]][1]
GRanges object with 1 range and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     chr5 139125308-139127808      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Correction: Since individual overlaps are each < 50, then result1 should be NA.

Best, CB

cb4github commented 1 year ago

Environment Info

> package.version("GenomicRanges")
[1] "1.48.0"
> Sys.info()
          sysname           release           version          nodename           machine             login              user    effective_user 
        "Windows"          "10 x64"     "build 19042" "VDI-XXX-0"          "x86-64"      "xxx"      "xxx"      "xxx" 
> R.version
               _                                
platform       x86_64-w64-mingw32               
arch           x86_64                           
os             mingw32                          
crt            ucrt                             
system         x86_64, mingw32                  
status                                          
major          4                                
minor          2.1                              
year           2022                             
month          06                               
day            23                               
svn rev        82513                            
language       R                                
version.string R version 4.2.1 (2022-06-23 ucrt)
nickname       Funny-Looking Kid

hpages commented 1 year ago

Hi,

When subject is a GRangesList object, its top-level elements are GRanges objects, which means that each top-level element in subject is made of multiple ranges. When findOverlaps() looks at the overlap between query[i] (a single range) and subject[[j]] (multiple ranges), and if minoverlap is specified, then what is taken into consideration is the total number of positions in subject[[j]] that are in the overlap.

For example:

library(GenomicRanges)

gr1 <- GRanges("chr5:11-25")
gr2 <- GRanges("chr5:31-40")
gr3 <- c(GRanges("chr5:11-20"), gr2)
subject <- GRangesList(gr1, gr2, gr3)
subject
# GRangesList object of length 3:
# [[1]]
# GRanges object with 1 range and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr5     11-25      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths
# 
# [[2]]
# GRanges object with 1 range and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr5     31-40      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#
# [[3]]
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr5     11-20      *
#   [2]     chr5     31-40      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Then:

query <- GRanges("chr5:18-35")

## 1 hit, as expected
findOverlaps(query, subject[[1]], minoverlap=7)
# Hits object with 1 hit and 0 metadata columns:
#       queryHits subjectHits
#       <integer>   <integer>
#   [1]         1           1
#   -------
#   queryLength: 1 / subjectLength: 1

## No hit, as expected
findOverlaps(query, subject[[2]], minoverlap=7)
# Hits object with 0 hits and 0 metadata columns:
#    queryHits subjectHits
#    <integer>   <integer>
#   -------
#   queryLength: 1 / subjectLength: 1

## No hit, as expected
findOverlaps(query, subject[[3]], minoverlap=7)
# Hits object with 0 hits and 0 metadata columns:
#    queryHits subjectHits
#    <integer>   <integer>
#   -------
#   queryLength: 1 / subjectLength: 2

Now with the GRangeslist object:

findOverlaps(query, subject, minoverlap=7)
# Hits object with 2 hits and 0 metadata columns:
#       queryHits subjectHits
#       <integer>   <integer>
#   [1]         1           1
#   [2]         1           3
#   -------
#   queryLength: 1 / subjectLength: 3

Yes, a hit is reported with the 3rd element in subject. That's because the total number of positions in this overlap is 8:

positions 18,19,20 in the first range of subject[[3]], so 3 positions
positions 31 to 35 in its second range, so 5 positions

The driving use case for this behavior is when the GRangesList object represents a set of exons grouped by transcript. In this case, each top-level element in subject represents a transcript, and the multiple ranges inside a given top-level element represent the exons in that transcript. Choosing a semantic where minoverlap lets the user control the minimum amount of overlap across all the exons in a transcript sounded more meaningful than choosing a semantic where minoverlap would control the minimum overlap that we must see with at least one exon in the transcript.

Hope this makes sense.

cb4github commented 1 year ago

Thanks for your prompt response, and, yes, that makes good sense for the driving use case - presumably including sets of non-overlapping exons within the respective gene isoforms.

See details of my case below, but briefly and unfortunately only in hindsight and after much trial and error, I now see that I need to migrate my code to using a subject of the simpler form GRanges vice a GRangesList.

Details In my case, I have 10000's of DMR's for each of which I'm interested in finding the first overlap of a single interval from among the following list of subject GRanges in order of precedence - "Promoter", "Promoter-downstream", "Pre-promoter", and "Intragenic" - including possibly internal, non-self overlaps and thus contrary to the sets-of-exons use case.

Also, out of my 10,000's of query items, my testing with only the first few (first 5, then 100) query items did not incur this issue, whereas the first item to incur this issue was my query item #24692:(.

Thus, since I had not seen any indication in the R-documentation of the criterion of total number of positions (vice maximum overlap with any single interval/range), it became easy for me to infer - although incorrectly so - that every hit in the subject GRangesList confirmed the existence of at least one single interval - from among those in the subject hit (GRanges) - that satisfied the original criterion of minoverlap = 50.

R-documentation I understand that the brevity of the R-documentation for GRanges promotes a lower barrier of entry for users by avoiding the wall of words.

However, in my case and possibly for others' benefit, it would have saved time and effort in design, coding and debugging, had there been a bit more detail and/or example in the R-documentation including and demonstrating the notion of total number of positions for subjects of form GRangesList.

I'm closing the issue, and thanks for reading, and thanks for all your efforts!

Best, CB

hpages commented 1 year ago

You're absolutely right. This is now addressed in GenomicRanges 1.50.1 (BioC 3.16, current release) and 1.51.1 (BioC 3.17, devel version). See commit 8864a12cefae0a1fc396438c579859d9a934a65a for the details.

Cheers, H.

Bioconductor / GenomicRanges

findOverlaps appears inconsistent #70