Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
57 stars 16 forks source link

Quality to integer conversion #19

Closed FelixErnst closed 5 years ago

FelixErnst commented 5 years ago

Hi,

there is a bug causing a warning during conversion of a XStringQuality object containing multiple sequences to integer values. This is probaly due to charToRaw not being vectorized.

# works
as.integer(PhredQuality("!!!!!!!!!h"))
# causes warning
as.integer(PhredQuality(c("!!!!!!!!!h","!!!!!!!!!h","!!!!!!!!!h")))
Warning message:
In charToRaw(as.character(x)) :
  argument should be a character vector of length 1
all but the first element will be ignored

If this behavior unintended or am I using it wrongly?

Kind regards Felix

PS: If this is really unintended behavior, a fix for this would concern line 54 of R/XStringQuality-class.R only. A change from

value <- as.integer(charToRaw(as.character(x))) - offset(scale)

to

value <- IntegerList(lapply(lapply(as.character(q),charToRaw),as.integer)) - offset(scale)

or

if(is.null(names(q))){
    unname(split(as.integer(unlist(q)),Rle(seq_along(q),width(q)))) - offset(scale)
} else { 
    split(as.integer(unlist(q)),Rle(names(q),width(q))) - offset(scale)
}

I didn't do any benchmarking.

hpages commented 5 years ago

Hi @FelixErnst ,

I think that as.integer() was intended to be used on XStringQuality objects of length 1 only. What could it return on an XStringQuality object of arbitrary length? Making it return an IntegerList object or ordinary list would be wrong because as.integer() should always return an integer vector, that is, an atomic vector.

When your XStringQuality object is of arbitrary length, you should coerce to IntegerList instead:

library(Biostrings)
pq <- PhredQuality(c("*+,-./", "0123456789:;"))
as(pq, "IntegerList")
# IntegerList of length 2
# [[1]] 9 10 11 12 13 14
# [[2]] 15 16 17 18 19 20 21 22 23 24 25 26

Starting with Biostrings 2.51.2, calling as.integer() or as.numeric() on an XStringQuality of length != 1 is now an error. The error message suggests to coerce to IntegerList or NumericList instead:

as.integer(pq)
# Error in .BStringSet2integer(x, scale) : 
#   as.integer()/as.numeric() can be used on a PhredQuality object of length 1 only.
#   Coerce to IntegerList or NumericList (with 'as( , "IntegerList")'
#   or 'as( , "NumericList")') when the object is of arbitrary length.

Hope that works for you. H.

PS: If you're using BioC 3.9 (the current devel version of Bioconductor, requires R 3.6), Biostrings 2.51.2 will become available via BiocManager::install() in the next 24h or so.

FelixErnst commented 5 years ago

Hi @hpages,

that seams reasonable. I didn't think about the specific return type at all. My first instinct was to use the as.integer function and I didn't think about coercing it to an IntegerList, which of course works fine.

However, just to bring it up: I find this a bit inconsistent. The XStringQualityobjects are derived from XStringSet. Shouldn't this mean, that they behave like a List as well, since they derive from it? Apart from as.character this would require a list like object to be returned to be able to hold the data. I would argue, that an XStringSet of length of 1 does not require such a special way of handling it.

(Am I right to assume, that this is just a shortcut for implementing a XStringQuality object derived from the XString class? I don't think as.integer is used internally either, is it?)

hpages commented 5 years ago

Hi @FelixErnst

Yes XStringQuality derives from List but what do you mean exactly when you say that an XStringQuality object should behave like a List? This is a pretty general statement and I agree with it. But I'm not sure what specific operation on a XStringQuality object do you think violates that statement. Are you arguing that as.integer() should return a list or IntegerList object? Unfortunately that is not an option. This would violate the very strong and universal expectation that as.integer() ALWAYS returns an atomic vector.

Regards, H.

FelixErnst commented 5 years ago

Hi @hpages

Now from your comments, it became clear that the result type of as.integer is the main focus and should behave as expected, which it does now (I learned something :-) ).

With this in mind I noticed after your initial comment , that as.integer should only work on a QualityString, but not on a QualityStringSet. The first class name is of course taken, potentially for historic reasons, and in the context of all the other classes in Biostrings QualityString should be named QualityStringSet.

This is what I meant with inconsistency. Maybe I am wrong, but the implementation of as.integer is misleading, since it should not be implemented in the first place for a list. as.integer(list(c("1","2"))) does not work, but is basically the equivalent of as.integer(PhredQuality("!!!!")).

I am not sure why the "correct" QualityString class derived from BString was never implemented and the name was recycled for a class which should actually be named QualityStringSet. There is probably a good reason, but I currently don't see it, which leaves the inconsistency with the special treatment for length = 1.

Best, Felix

hpages commented 5 years ago

These are good points.

Note that there are no QualityString or QualityStringSet classes, only an XStringQuality class which extends BStringSet. Yes the name of the XStringQuality class is misleading. It should probably have been QualityStringSet or QStringSet. And yes the original developer of this class should probably have implemented the QString class first and build on top of that. But that's not what he did and we have to live with his early choices. That includes living with the fact that as.integer() works on an XStringQuality object of length 1 (when it should have worked on QString objects only and for QStringSet objects the user would have had to always use coercion to IntegerList, even when the object has length 1).

There are other inconsistencies/incongruities with these classes but that doesn't mean we should add more by allowing as.integer() to work on XStringQuality objects of arbitrary length. Rather than trying to fix the XStringQuality classes and subclasses I think that at some point we should rather start fresh by implementing the QString and QStringSet classes, and their subclasses PhredQString/PhredQStringSet, SolexaQString/SolexaQStringSet, and IlluminaQString/IlluminaQStringSet.

Unfortunately I don't have much time to dedicate to the Biostrings package these days. Note that the package actually needs a major overhaul and the addition of long wanted features especially in the pairwiseAlignment() department but not only (other long wanted features are restriction of the letters used in an AAString object, implementation of vmatchPDict(), and more...). I already talked with Eric Wright about this (Eric is the maintainer of the DECIPHER package which relies heavily on Biostrings). The good news is that Eric is willing to provide some resources to help this happen but I still need to make a detailed list (possibly with roadmap) of what needs to happen. I'll try to put something on Biostrings' built-in GitHub Wiki soon.

H.

FelixErnst commented 5 years ago

I submitted a package which relies heavily on the Biostrings package and there will probably another contribution in a month or so, which will intersect closely with the Biostring package.

https://github.com/Bioconductor/Contributions/issues/962

So, if I can be of help during the overhaul please let me know.

Edit: the issue probably can be closed

hpages commented 5 years ago

Thx for your contribution (I'll take a look at it) and for offering to help with the overhaul. -- H.