Closed cb4github closed 1 year ago
Environment Info
> package.version("GenomicRanges")
[1] "1.48.0"
> Sys.info()
sysname release version nodename machine login user effective_user
"Windows" "10 x64" "build 19042" "VDI-XXX-0" "x86-64" "xxx" "xxx" "xxx"
> R.version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
crt ucrt
system x86_64, mingw32
status
major 4
minor 2.1
year 2022
month 06
day 23
svn rev 82513
language R
version.string R version 4.2.1 (2022-06-23 ucrt)
nickname Funny-Looking Kid
Hi,
When subject
is a GRangesList object, its top-level elements are GRanges objects, which means that each top-level element in subject
is made of multiple ranges. When findOverlaps()
looks at the overlap between query[i]
(a single range) and subject[[j]]
(multiple ranges), and if minoverlap
is specified, then what is taken into consideration is the total number of positions in subject[[j]]
that are in the overlap.
For example:
library(GenomicRanges)
gr1 <- GRanges("chr5:11-25")
gr2 <- GRanges("chr5:31-40")
gr3 <- c(GRanges("chr5:11-20"), gr2)
subject <- GRangesList(gr1, gr2, gr3)
subject
# GRangesList object of length 3:
# [[1]]
# GRanges object with 1 range and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr5 11-25 *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
#
# [[2]]
# GRanges object with 1 range and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr5 31-40 *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
#
# [[3]]
# GRanges object with 2 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr5 11-20 *
# [2] chr5 31-40 *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
Then:
query <- GRanges("chr5:18-35")
## 1 hit, as expected
findOverlaps(query, subject[[1]], minoverlap=7)
# Hits object with 1 hit and 0 metadata columns:
# queryHits subjectHits
# <integer> <integer>
# [1] 1 1
# -------
# queryLength: 1 / subjectLength: 1
## No hit, as expected
findOverlaps(query, subject[[2]], minoverlap=7)
# Hits object with 0 hits and 0 metadata columns:
# queryHits subjectHits
# <integer> <integer>
# -------
# queryLength: 1 / subjectLength: 1
## No hit, as expected
findOverlaps(query, subject[[3]], minoverlap=7)
# Hits object with 0 hits and 0 metadata columns:
# queryHits subjectHits
# <integer> <integer>
# -------
# queryLength: 1 / subjectLength: 2
Now with the GRangeslist object:
findOverlaps(query, subject, minoverlap=7)
# Hits object with 2 hits and 0 metadata columns:
# queryHits subjectHits
# <integer> <integer>
# [1] 1 1
# [2] 1 3
# -------
# queryLength: 1 / subjectLength: 3
Yes, a hit is reported with the 3rd element in subject
. That's because the total number of positions in this overlap is 8:
subject[[3]]
, so 3 positionsThe driving use case for this behavior is when the GRangesList object represents a set of exons grouped by transcript. In this case, each top-level element in subject
represents a transcript, and the multiple ranges inside a given top-level element represent the exons in that transcript. Choosing a semantic where minoverlap
lets the user control the minimum amount of overlap across all the exons in a transcript sounded more meaningful than choosing a semantic where minoverlap
would control the minimum overlap that we must see with at least one exon in the transcript.
Hope this makes sense.
Thanks for your prompt response, and, yes, that makes good sense for the driving use case - presumably including sets of non-overlapping exons within the respective gene isoforms.
See details of my case below, but briefly and unfortunately only in hindsight and after much trial and error, I now see that I need to migrate my code to using a subject of the simpler form GRanges vice a GRangesList.
Details In my case, I have 10000's of DMR's for each of which I'm interested in finding the first overlap of a single interval from among the following list of subject GRanges in order of precedence - "Promoter", "Promoter-downstream", "Pre-promoter", and "Intragenic" - including possibly internal, non-self overlaps and thus contrary to the sets-of-exons use case.
Also, out of my 10,000's of query items, my testing with only the first few (first 5, then 100) query items did not incur this issue, whereas the first item to incur this issue was my query item #24692:(.
Thus, since I had not seen any indication in the R-documentation of the criterion of total number of positions (vice maximum overlap with any single interval/range), it became easy for me to infer - although incorrectly so - that every hit in the subject GRangesList confirmed the existence of at least one single interval - from among those in the subject hit (GRanges) - that satisfied the original criterion of minoverlap = 50.
R-documentation I understand that the brevity of the R-documentation for GRanges promotes a lower barrier of entry for users by avoiding the wall of words.
However, in my case and possibly for others' benefit, it would have saved time and effort in design, coding and debugging, had there been a bit more detail and/or example in the R-documentation including and demonstrating the notion of total number of positions for subjects of form GRangesList.
I'm closing the issue, and thanks for reading, and thanks for all your efforts!
Best, CB
You're absolutely right. This is now addressed in GenomicRanges 1.50.1 (BioC 3.16, current release) and 1.51.1 (BioC 3.17, devel version). See commit 8864a12cefae0a1fc396438c579859d9a934a65a for the details.
Cheers, H.
Describe the bug Empty result from findOverlaps(GRanges, Granges) appears to disagree with a non-empty Granges result obtained from findOverlaps(GRanges, GRangesList)
To Reproduce build test subject
build test query
find first hit in GRangesList...
...yet no first hit in corresponding GRanges item
Observe intersection widths are individually <50 yet their sum 38+38=76 > 50. Is this the desired behavior? Please let me know if I can provide more info, thanks.
Expected behavior result2 should be the first entry in testSubject[[1]]
Correction: Since individual overlaps are each < 50, then result1 should be NA.
Best, CB