Bioconductor / GenomicRanges

Representation and manipulation of genomic intervals
https://bioconductor.org/packages/GenomicRanges
41 stars 17 forks source link

Making seqinfo<- easier to use with new levels #54

Closed LTLA closed 3 years ago

LTLA commented 3 years ago

This works:

library(GenomicRanges)
len <- c(chrB=10, chrA=20)
x <- GRanges("chrA:1-10", seqinfo=Seqinfo(seqnames=names(len), seqlengths=len))

This doesn't:

x <- GRanges("chrA:1-10")
seqinfo(x) <- Seqinfo(seqnames=names(len), seqlengths=len))
## Error in GenomeInfoDb:::makeNewSeqnames(x, new2old = new2old, seqlevels(value)) :
##   when 'new2old' is NULL, the first elements in the
##   supplied 'seqlevels' must be identical to 'seqlevels(x)'

What gives? Everything is named so there should be no problem defining a mapping automatically.

So currently I have to take a weird and unintuitive detour to do what I need to do:

seqlevels(x) <- names(len)
seqlengths(x) <- len
hpages commented 3 years ago

When the seqlevels of the supplied Seqinfo object are not in the same order than in x, the real intention behind

seqinfo(x) <- my_seqinfo

is ambiguous i.e. you could be trying to reorder the seqlevels or to rename them. In your case, you want to rename, and you make this explicit by doing:

seqlevels(x) <- seqlevels(my_seqinfo)

first. Now that the seqlevels in x are in the same order than in my_seqinfo, the intention behind

seqinfo(x) <- my_seqinfo

is no longer ambiguous.

LTLA commented 3 years ago

Not sure what you mean by renaming, because seqnames(x) is still the same before and after my seqlevels<- call:

x <- GRanges("chrA:1-1000")
as.character(seqnames(x))
## [1] "chrA"

seqlevels(x) <- c("chrB", "chrA")
as.character(seqnames(x))
## [1] "chrA"

Which is good, because I definitely wouldn't have expected that to become "chrB".

Anyway, I understand your general point about renaming and reordering. But it seems that there is scope for a much friendlier flag in seqinfo<- to handle the common case where people just want to decorate an existing GRanges with a Seqinfo object, often with the seqlevels in a different order from what is present in x. Maybe something like:

seqinfo(x, new2old="names") <- my_seqinfo

would have the same effect as running seqlevels(x) <- seqlevels(my_seqinfo) first. Plus a better error message, e.g., if we see things out of order but the replacement seqlevels are a superset of those in seqnames(x), we might suggest new2old="names". This would cut down on a lot of confusion.

hpages commented 3 years ago

Renaming the seqlevels means that seqlevel "chrA" gets renamed "chrB" and seqlevel "chrB" gets renamed "chrA". This is not the same as reordering the seqlevels. For example with a factor:

> f <- factor(c("y", "y", "y", "Z", "Z"))
> f
[1] y y y Z Z
Levels: y Z
> levels(f) <- rev(levels(f))
> f
[1] Z Z Z y y
Levels: Z y

As you can see level "y" got renamed "Z", and level "Z" got renamed "y".

If we want to switch the order of the levels, we need to do something like:

> f <- factor(c("y", "y", "y", "Z", "Z"))
> f
[1] y y y Z Z
Levels: y Z
> factor(f, levels=rev(levels(f)))
[1] y y y Z Z
Levels: Z y

Now the levels got reordered instead of renamed.

Whoever decided the semantic of base::levels<- chose the renaming behavior. Of course this is an arbitrary choice. Some people might find this choice natural, others might disagree. It's a situation where there is no right choice. Both choices are wrong because they both lead to bugs. People on both sides will assume that base::levels<- does what they think it does and half of them will be wrong.

When I introduced seqinfo<- (this was at least 6 years ago), I faced the same problem as the author of base::levels<- did. Which is that, when my_seqinfo has the same seqlevels as x, but in a different order, the meaning of seqinfo(x) <- my_seqinfo is fundamentally ambiguous, in the same way that the meaning of levels(f) <- rev(levels(f)) is ambiguous. In that kind of situation, I don't think that choosing one behavior over the other is good design, because this will open the door to buggy code. I'd rather raise an error so the developer is forced to explicitly disambiguate. Either by using the 2-step idiom, if they want to reorder the seqlevels:

seqlevels(x) <- seqlevels(my_seqinfo)  # explicitly reorder
seqinfo(x) <- my_seqinfo

or by specifying the new2old mapping, if they want to rename the seqlevels:

seqlevels(x, new2old=1:2) <- my_seqinfo

Note that the reordering case can also be handled by specifying the new2old mapping:

seqlevels(x, new2old=2:1) <- my_seqinfo

Some situations can be a lot more complicated than that e.g. when my_seqinfo and x have different seqlevels but share some of them, and when some of the shared seqlevels must be renamed and others reordered. Using the new2old interface allows the user to specify exactly and precisely what they want to do, in any situation.

I agree that the error message could be improved. I'll work on that.

LTLA commented 3 years ago

Is it documented that seqlevels<- will reorder instead of rename? I can't find this anywhere in ?seqlevels. I mean, I'm fine with the requirement that the user explicitly disambiguate between renaming and reordering for seqinfo<-, but if that's the philosophy, why take the reordering default for seqlevels<-? It's not like it's consistent with base::levels<-, either.

Regardless, it seems like it would be a good idea to offer an argument in seqinfo<- that explicitly distinguishes between rename vs reorder. It would be easier to remember, use and document compared to the two-step procedure. You could even avoid an extra argument by overloading, e.g., new2old="reorder" to request a reordering of existing levels.

hpages commented 3 years ago

The behavior of seqlevels<- is documented in ?seqlevels.

With GenomeInfoDb 1.27.5:

library(GenomicRanges)
x <- GRanges(c("chrA:1-10", "chrB:11-20"))

seqinfo(x) <- Seqinfo(c("chrB", "chrA", "chrC"))
# Error in GenomeInfoDb:::makeNewSeqnames(x, new2old = new2old, seqlevels(value)) : 
#   The seqlevels in the supplied Seqinfo object are not the same as the
#   seqlevels in 'x'. To map them to the seqlevels of the same name in 'x',
#   the easiest way is to propagate them to 'x' with:
# 
#     seqlevels(x) <- seqlevels(new_seqinfo)
# 
#   before calling the 'seqinfo()' setter.
#   For any more complicated mapping, please use the 'new2old' argument.

seqinfo(x) <- Seqinfo(c("chrB", "chrA"))
# Error in GenomeInfoDb:::makeNewSeqnames(x, new2old = new2old, seqlevels(value)) : 
#   The seqlevels in the supplied Seqinfo object are not in the same order
#   as the seqlevels in 'x'. Please reorder the seqlevels in 'x' with:
# 
#     seqlevels(x) <- seqlevels(new_seqinfo)
# 
#   before calling the 'seqinfo()' setter.
#   For any more complicated mapping between the new and old seqlevels
#   (e.g. for a mapping that will result in the renaming of some seqlevels
#   in 'x'), please use the 'new2old' argument.

seqinfo(x) <- Seqinfo("chrB")
# Error in GenomeInfoDb:::makeNewSeqnames(x, new2old = new2old, seqlevels(value)) : 
#   The seqlevels in the supplied Seqinfo object are not the same as the
#   seqlevels in 'x'. To map them to the seqlevels of the same name in 'x',
#   the easiest way is to propagate them to 'x' with:
# 
#     seqlevels(x) <- seqlevels(new_seqinfo)
# 
#   before calling the 'seqinfo()' setter. Note that you might need to
#   specify a pruning mode (via the 'pruning.mode' argument) if this
#   operation will drop seqlevels that are in use in 'x'.
#   For any more complicated mapping, please use the 'new2old' argument.

seqinfo(x) <- Seqinfo(c("chrX", "chrY"))
# Error in GenomeInfoDb:::makeNewSeqnames(x, new2old = new2old, seqlevels(value)) : 
#   The seqlevels in the supplied Seqinfo object are not the same as the
#   seqlevels in 'x'. Please use the 'new2old' argument to specify the
#   mapping between the formers and the latters.

The only situation where you get away with having to specify new2old is when you append new seqlevels to the existing ones:

seqinfo(x) <- Seqinfo(c("chrA", "chrB", "chrC"))

I'm not too keen on the new2old="reorder" thing. Sometimes the reordering is combined with the dropping of some seqlevels, which adds some complication if the seqlevels to drop are in use in x. It's easier to sort out what to do with the seqlevels first (via the seqlevels<- interface), and once the seqlevels in x are aligned with the seqlevels of the supplied Seqinfo, importing the new Seqinfo to x is straightforward and guaranteed to work.

LTLA commented 3 years ago

The improved error message looks good.

The behavior of seqlevels<- is documented in ?seqlevels.

I assume that you are referring to the comment chunk at the start of the Examples of ?seqlevels, which I just found. Why not promote it to a full section in the docs? Plus some explanation of the motivation (i.e., the reorder/rename ambiguity)? I only read the Examples when I want to, well, run some examples. (Which I didn't in this case, because I already knew the behavior, I just didn't understand the why.) I wasn't expecting an important conceptual detail to be tucked away there.

Also, there's a typo in

Examples below show how to mofify the seqlevels of GRanges, GRangesList, and TxDb objects but the approach is the same for any object that has seqlevels.

hpages commented 3 years ago

I assume that you are referring to the comment chunk at the start of the Examples of ?seqlevels

Nope. just below that: MODIFY THE SEQLEVELS OF A GRanges OBJECT

C'mon it's all in CAPITAL LETTERS!

No, seriously, I was hopping that explaining via some simple examples would be easier/clearer and less likely to be missed than having this lost in the middle of a big Details section that nobody will read.

I only read the Examples when I want to, well, run some examples.

Weird!

LTLA commented 3 years ago

I didn't even scroll to that point, even after 3-4 readings! I usually stop when I've exhausted all the content above See also, which I consider to be "all the stuff that the author says the function should do".

Anyway, after reading the examples, I've realized that the seqlevels<- setter behavior is actually quite complex. Perhaps you might consider giving it a little section of its own:

\section{Behavior of seqlevels<-}{
Finding overlaps, comparing, and matching operations between objects
containing genomic ranges require the objects to have the same
seqlevels, or they fail. So before one can perform these operations,
it is often necessary to modify the seqlevels in some of the objects
involved in the operation so that all the objects have the same
seqlevels. 

Such modifications are typically done with the \code{seqlevels()} setter. It can
rename, drop, add and reorder seqlevels of an object, depending on \code{value}.
\itemize{
\item If \code{value} is an unnamed character vector where the levels are a superset of those in \code{seqlevels(x)},
the new levels are added to \code{x} and all levels are ordered according to \code{value}.
\item If \code{value} is an unnamed character of the same length as (and is not a superset of) \code{seqlevels(x)},
each existing level in \code{x} is renamed with the corresponding entry in \code{value}.
Any existing seqlevels in \code{seqlevels(x)} should occur at the same position in \code{value}.
\item If \code{value} is an unnamed character vector that is shorter than \code{seqlevels(x)},
ranges in \code{x} are removed if they have seqlevels not in \code{value}.
The manner of removal requires specification of \code{pruning.mode}.
\item If \code{value} is a named character vector, any existing seqlevel \code{y} is changed to \code{value[y]}.
By default, all seqlevels in use should have names in \code{value}? % not really sure about this one. What happens with unnamed entries?
Otherwise, \code{pruning.mode} should be set to determine how to discard ranges with seqlevels not in \code{names(value)}.
}

See the examples below, which show how to modify the seqlevels of GRanges, GRangesList, and TxDb objects.
The approach is the same for any object that has seqlevels.

Note that the \code{seqinfo()} setter will not, in general, modify the seqlevels by default;
the only exception is that of appending new seqlevels to the existing ones in \code{x}.
Any call to the \code{seqinfo()} setter that modifies the seqlevels should either specify `new2old`
or should be preceded by a call to the \code{seqlevels()} setter as described above.
}

I'm not sure I've covered all the possibilities above correctly, even after staring at the examples for a while.

hpages commented 3 years ago

Ok, I'll add something like this.

Note that in the context of a binary operation between 2 objects x and y, both of them with seqinfo components, we generally want to (1) make sure that x and y are based on the same genome, and (2) adjust the seqinfo on x and y to make them identical before proceeding with the binary operation. In that context the idiom discussed above, that is:

seqlevels(x) <- seqlevels(y)
seqinfo(x) <- seqinfo(y)

is not really relevant. Instead the preferred idiom in that case is something like this:

si <- merge(seqinfo(x), seqinfo(y))  # will fail if seqinfo(x) and seqinfo(y) are incompatible
seqlevels(y) <- seqlevels(si)
seqinfo(x) <- seqinfo(y) <- si  # guaranteed to work

or some variant of it. Note that there's no need to do seqlevels(x) <- seqlevels(si) because merge(x, y) will always append the foreign seqlevels in y to the seqlevels in x.

I'll add this to the man page too.

LTLA commented 3 years ago

Thanks, I didn't know that second point either. Might be handy to have a mergeSeqlevels(x, y, ...) function to handle the three-liner above; it could be particularly valuable for multiple (>2) objects, rather than the user doing a bunch of sequential merges.

hpages commented 3 years ago

The merge() method for Seqinfo objects is N-ary.

LTLA commented 3 years ago

Probably should note that in the docs, I could only see merge(x, y) in ?Seqinfo.

Anyway, my main point with a hypothetical mergeSeqlevels was that it could also handle the various assignments, e.g.,

# Returns a list of inputs where seqlevels<- and seqinfo<- have all been run.
out <- mergeSeqlevels(x, y, z, aa, bb)

# or for a variable number of inputs, a one liner:
out <- do.call(mergeSeqlevels, inputs)

# vs the current state of affairs:
si <- merge(seqinfo(x), seqinfo(y), seqinfo(z), seqinfo(aa), seqinfo(bb))
seqlevels(y) <- seqlevels(z) <- seqlevels(aa) <- seqlevels(bb) <- seqlevels(si)
seqinfo(x) <- seqinfo(y) <- seqinfo(aa) <- seqinfo(bb) <- si

Seems like it would be a moderate quality-of-life improvement.

hpages commented 3 years ago

Quality-of-life improvement for who? I never run into a situation like this. Have you?

All the situations I've dealt with so far are binary (e.g. findOverlaps() and family, pcompare(), etc...). The only exception I can remember of is the bindROWS() method for GRanges objects, which is the workhorse behind combining a bunch of GRanges object with c(). It uses GenomicRanges:::combine_seqinfo_from_GenomicRanges_objects() internally to combine all the seqinfo objects. So it's a case where the 3-step idiom above is not even needed, only its 1st step.

Furthermore, if you need to operate on a big collection of GRanges objects that are all based on the same reference genome, you probably obtained them via a mechanism that guarantees that all their seqinfos are already the same, by construction. You might also want to consider storing them in a GRangesList object. Will probably be easier and more convenient to work with that, and they will all be forced to share the same seqinfo.

Finally what would out be for your mergeSeqlevels()? A list containing all the modified objects? yuck!! The name of the function ("mergeSeqlevels") was already a bad start.

I mean, I'm not saying this can't be worked out but that's beyond the scope of this issue.

LTLA commented 3 years ago

Happens all the time. With public datasets, you often run into the case where the different objects nominally come from the same genome but each of them are missing a few seqlevels here and there, depending on what the authors deigned to actually include in their genome index. Or possibly even with extra seqlevels, e.g., corresponding to transgenes or plasmids or whatever the authors slapped onto the end of a "standard" reference genome. Once you have multiple such datasets that you're trying to analyze at the same time, you run into non-binary seqinfo differences.

Anyway. I didn't think a list was particularly bad, but since you mention it, the hypothetical function might as well return a GRangesList. In fact, the GRangesList constructor itself does the job of mergeSeqlevels() quite nicely:

x <- GRanges("chrA:1-100")
y <- GRanges("chrB:2-200")
z <- GRanges("chrC:3-300")
aa <- GRanges("chrD:4-400")
bb <- GRanges("chrE:5-500")

# Naive approach gives me annoying warnings:
naive <- c(x, y, z, aa, bb)
## Warning messages:
## 1: In .Seqinfo.mergexy(x, y) :
##   The 2 combined objects have no sequence levels in common. (Use
##   suppressWarnings() to suppress this warning.)
## 2: In .Seqinfo.mergexy(x, y) :
##   The 2 combined objects have no sequence levels in common. (Use
##   suppressWarnings() to suppress this warning.)
## 3: In .Seqinfo.mergexy(x, y) :
##   The 2 combined objects have no sequence levels in common. (Use
##   suppressWarnings() to suppress this warning.)
## 4: In .Seqinfo.mergexy(x, y) :
##   The 2 combined objects have no sequence levels in common. (Use
##   suppressWarnings() to suppress this warning.)

# Nice and clean, no warnings.
out <- GRangesList(list(x, y, z, aa, bb))
unlist(out)
## GRanges object with 5 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chrA     1-100      *
##   [2]     chrB     2-200      *
##   [3]     chrC     3-300      *
##   [4]     chrD     4-400      *
##   [5]     chrE     5-500      *
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths
seqinfo(out)
## Seqinfo object with 5 sequences from an unspecified genome; no seqlengths:
##   seqnames seqlengths isCircular genome
##   chrA             NA         NA   <NA>
##   chrB             NA         NA   <NA>
##   chrC             NA         NA   <NA>
##   chrD             NA         NA   <NA>
##   chrE             NA         NA   <NA>

Which seems like a nice little trick. I think I'll just use that in the future, though it is limited to GRanges inputs.

hpages commented 3 years ago

yep, GRangesList FTW!

hpages commented 3 years ago

FYI the 2-step idiom for using the seqinfo() setter is now documented in GenomeInfoDb 1.27.6.