Bioconductor / GenomicRanges

Representation and manipulation of genomic intervals
https://bioconductor.org/packages/GenomicRanges
41 stars 17 forks source link

findOverlaps with type="equal" and a GRangesList #16

Open kasperdanielhansen opened 5 years ago

kasperdanielhansen commented 5 years ago

When I use findOverlaps with type="equal" and a GRangesList I get an error:

> findOverlaps(gr, grIntrons.24, type = "equal")
Error in match.arg(type) : 
  'arg' should be one of “any”, “start”, “end”, “within”

(in this case gr is a GRanges and grIntrons.24 is a GRangesList).

This is made even more confusing by the generic

> findOverlaps
standardGeneric for "findOverlaps" defined from package "IRanges"

function (query, subject, maxgap = -1L, minoverlap = 0L, type = c("any", 
    "start", "end", "within", "equal"), select = c("all", "first", 
    "last", "arbitrary"), ...) 
standardGeneric("findOverlaps")
<bytecode: 0x7f8a47c5aea8>
<environment: 0x7f8a48e3a1e0>
Methods may be defined for arguments: query, subject
Use  showMethods("findOverlaps")  for currently available ones.

which strongly suggests type="equal" is valid.

lawremi commented 5 years ago

I've recently encountered a use case for the "equal" type. I've defined it along the lines of setequal(), so duplicates and order are ignored when determining whether two compound ranges are "equal". This is consistent with type "within" which checks for whether one is a subset of the other. If this sounds OK then I will push it.

hpages commented 5 years ago

type="within" does not seem to treat a compound range as a set of positions:

gr <- GRanges("chr1:11-15")
grl <- GRangesList(GRanges(c("chr1:11-13", "chr1:12-15")))
findOverlaps(gr, grl, type="within")
# Hits object with 0 hits and 0 metadata columns:
#    queryHits subjectHits
#    <integer>   <integer>
#   -------
#   queryLength: 1 / subjectLength: 1

Given that type "equal" is expected to be more stringent than type "within", it would be counter-intuitive to get a hit in the above situation when replacing type="within" with type="equal".

lawremi commented 5 years ago

I guess what I meant is that type="within" requires a within-match for all query ranges (so the query is a subset of the subject in a more general sense), while type="equal" requires equality for all query ranges, and all subject ranges (so it is more like set equality). This is at the range level, not position level.

hpages commented 5 years ago

mmh I see. So IIUC in the GRanges#GRangesList case (Kasper's use case), type="equal" should report a hit when:

gr <- GRanges("chr1:11-15")
grl <- GRangesList(GRanges(c("chr1:11-15", "chr1:11-15")))

but not when:

grl <- GRangesList(GRanges(c("chr1:11-15", "chr1:11-14")))

@kasperdanielhansen is that the semantic you're after?

lawremi commented 5 years ago

That's right. It was easier to program and more efficient to ignore the duplicates and order and was good enough for my use case. That was the only rationale.

hpages commented 5 years ago

Looks like @lawremi pushed this back in March (commit f25a45f33b5df85b2c63b3d21189982e0504a5cb). Any chance Michael you can add a few unit tests and maybe an example in the man page for this new feature?

Feel free to close the issue. Thanks!