Bioconductor / GenomicRanges

Representation and manipulation of genomic intervals
https://bioconductor.org/packages/GenomicRanges
41 stars 17 forks source link

Add GRangesFactor class #25

Closed hpages closed 5 years ago

hpages commented 5 years ago

A specialization of Factor objects where the levels can be any GRanges object or derivative.

As discussed here: Bioconductor/Contributions#1114

LTLA commented 5 years ago

I can start working on this, but I don't know whether it needs to be something more sophisticated than a Factor subclass. It would be nice to be able to use methods for GRanges on a GRangesFactor, and I don't know how to do that automatically without re-writing special methods for GRangesFactors.

hpages commented 5 years ago

Not much more than that indeed. I have something that I didn't commit yet: it adds the Factor subclass + a dedicated constructor + some additional coerce methods to go back and forth between GRanges and GRangesFactor + a nice show method + a man page + some unit tests.

Having GRanges methods work on a GRangesFactor will be convenient but I need to think about the best way of doing this. The "standard" way to go would be to make GRangesFactor inherit from GenomicRanges and implement a granges() method for GRangesFactor objects. That normally makes many things work out-of-the-box. However there could be some undesirable effects with making GRangesFactor extend GenomicRanges so I want to look into this carefully.

hpages commented 5 years ago

Hi Aaron,

This is done in GenomicRanges 1.37.11: https://github.com/Bioconductor/GenomicRanges/commit/d08ab13a05498d0e2d5b19413b53eb2e854d1dad

I defined a few GRanges methods to work on GRangesFactor objects (e.g. seqinfo(), seqnames(), start(), end(), width(), strand(), only getters for now). Nothing smart: they're all explicitly defined for GRangesFactor. They are 1- or 2-liners so not a big deal. I'm still hesitant about making GRangesFactor inherit from GenomicRanges: this would make a few things like findOverlaps() & family and coverage() work out-of-the-box but they wouldn't be optimized. Sounds like you were interested in having these things optimized for GRangesFactor. This will be for the next round though as I REALLY need to work on my workshop for the conf in NY.

In the mean time I'm putting the Factor/GRangesFactor stuff on the back burner and adding the optimized methods and unit tests to the TODO list.

LTLA commented 5 years ago

Sure, I was just wondering whether there was some nice easy solution to make everything work out of the box. The dream would be for end-users to not even notice that it was a GRangesFactor and not a GRanges. If there isn't an easy way, I guess we'll just have to implement everything to cover all the bases.

hpages commented 5 years ago

There is a subtle but important difference with a GRanges though: comparison between GRangesFactor objects only takes the index into account. It's not a range-based comparison like for GRanges objects. This is why I don't think we can make GRangesFactor inherit from GenomicRanges (comparing/ordering the elements of a GenomicRanges derivative is range-based). GRanges and GRangesFactor objects are semantically different in that aspect.

hpages commented 5 years ago

If we want to have a range-based comparison for GRangesFactor objects, we could do that. Then I think it would be safe/ok to make them inherit from GenomicRanges. I just don't know enough about the GRangesFactor use cases to choose the best semantic.

LTLA commented 5 years ago

comparison between GRangesFactor objects only takes the index into account.

Hm, that's not what I had in mind. I was thinking that a GRangesFactor would behave as if it were a GRanges, the only difference being that it saved memory and allowed for certain optimizations. Even factor objects seem to compare the levels[index] rather than just the index:

f1 <- factor(1:5, levels=1:5)
f2 <- factor(1:5, levels=5:1) 
f1==f2 # all TRUE, as one might expect from comparing values.
as.integer(f1)==as.integer(f2) # all FALSE except the middle, when forced to compare indices.

Perhaps Factor was a misnomer, and it would be better to think of it as an CompactGRanges of some sort. I assume that, when you're talking about factor semantics, you're referring to the cases where they sometimes behave like integers and sometimes like characters. I don't care much for having to deal with those headaches when working with Factor - though I would like to keep levels() and levels()<-.

hpages commented 5 years ago

Note that currently I've only allowed comparison between Factor objects with the same levels in the same order so for == (or !=) it doesn't matter whether the comparison is index- or level- based, as they both give the same result. It's just that using index-based comparison was the easy way to go. However for <=, <, >, and >= (which are not supported between ordinary factors), index-based and levels- based comparison are different.

Anyway, it's easy to change that. So let's have Factor comparison be level-based instead of index-based. In that case there is no more reason to restrict comparison to Factor objects with same levels in same order (the only reason for this restriction was to support an index-based comparison that wouldn't be too insane). Then we CAN have GRangesFactor inherit from GenomicRanges because then GRangesFactor objects won't be distinguishable from GRanges objects as far as the GenomicRanges API is concerned. And It's true that for most users the index/levels business won't matter anymore and will just be an internal representation business. However I still think that GRangesFactor is a good name: some advanced users will want to know about the index/levels business and GRangesFactor reflects that (and these advanced users have the levels()/as.integer()/unfactor() API to deal with that).

Are we on the same page? Glad we discussed this.

LTLA commented 5 years ago

Sounds like a plan. Good luck with your workshop.

hpages commented 5 years ago

Factor comparison/ordering is now level-based instead of index-based: https://github.com/Bioconductor/S4Vectors/commit/c7966fe614d1aad236d6930079d5db10b5ac8a05

Still need to make GRangesFactor inherit from GenomicRanges: issue #26

Closing this now.