Closed thallinger closed 2 years ago
I think you can consider another approach which is already implemented and also a bit more verbose in terms of classes and methods used
> fq <- readQualityScaledDNAStringSet(fl)[1:4]
> fq.trimmed <- subseq(fq, start=10, end=30)
> fq.trimmed
A QualityScaledDNAStringSet instance containing:
DNAStringSet object of length 4:
width seq names
[1] 21 AGGATACCCTCGCTTTCCTTC HWI-EAS88_1_1_1_1...
[2] 21 CCTATTAGTGGTTGAACAGCA HWI-EAS88_1_1_1_8...
[3] 21 TATAGTGTTATTAATATCAAT HWI-EAS88_1_1_1_9...
[4] 21 ATGTTATTTCTTCATTTGGAG HWI-EAS88_1_1_1_8...
PhredQuality object of length 4:
width seq
[1] 21 ]]]Y]Y]]]]]]]]]]]]VCH
[2] 21 ]]]Y]]]]]]]]]YPV]T][P
[3] 21 ]V]T]]]]]T]]]]]V]TMJE
[4] 21 ]]]]]]]]]]]]]T]]]]RJR
Your approach to subsetting is reusing the constructor, whereas subseq
is the method you actually want to use in this scenario.
Yes, Felix approach is the recommended way to load short DNA sequences and their qualities in Bioconductor.
Sorry I spoke too soon and I'm actually taking back my offer to modify DNAStringSet(x, start=, end=)
and narrow(x, start=, end=)
to also return and narrow the quality information stored in the metadata columns of DNAStringSet object x
. The reason for this is that the content of the metadata columns is free, has no well-defined meaning a priori, and so cannot be controlled by the DNAStringSet class and its methods. In other words, there's no way in general to know the impact of endomorphic transformations like narrow(x, ...)
or reverseComplement(x, ...)
on the metadata columns of x
.
Said more bluntly: the metadata columns of x
are the user business, not the class business.
This is a fundamental difference between information stored in a metadata column vs information stored in a slot of the object.
QualityScaledDNAStringSet objects have a quality
slot, a slot getter (quality()
), and the content of the slot is well-defined and controlled by the class and its methods. Any endomorphic transformation supported by the class (e.g. narrow()
, reverseComplement()
, etc...) will do whatever is appropriate to the quality
slot:
library(Biostrings)
dna <- DNAStringSet(c("AAA", "TTAGGGGGGTT", "CCCACCTAA"))
qual <- PhredQuality(c("+*-", "67897444445", "123456789"))
x <- QualityScaledDNAStringSet(dna, qual)
quality(x)
# PhredQuality object of length 3:
# width seq
# [1] 3 +*-
# [2] 11 67897444445
# [3] 9 123456789
quality(narrow(x, start=3))
# PhredQuality object of length 3:
# width seq
# [1] 1 -
# [2] 9 897444445
# [3] 7 3456789
quality(reverseComplement(x))
# PhredQuality object of length 3:
# width seq
# [1] 3 -*+
# [2] 11 54444479876
# [3] 9 987654321
In light of this, you could argue that it's not great design that readDNAStringSet(fl, format="fastq", with.qualities=TRUE)
returns a DNAStringSet object and not a QualityScaledDNAStringSet object, and maybe it should be modified to return the latter (it would still return an object that is considered to be a DNAStringSet from an is()
point of view because QualityScaledDNAStringSet extends DNAStringSet). However, note that this is exactly what readQualityScaledDNAStringSet()
does i.e. this function is just a thin wrapper around readDNAStringSet(fl, format="fastq", with.qualities=TRUE)
that turns the DNAStringSet object and its qualities returned by the latter into a QualityScaledDNAStringSet object.
In other words, think of readDNAStringSet(fl, format="fastq", with.qualities=TRUE)
as a low-level utility whose main purpose is to support readQualityScaledDNAStringSet()
.
I'm going to add a note to readDNAStringSet
's man page about this.
As for the DNAStringSet()
constructor function dropping the metadata columns of the input object, I'm going to modify it to emit a warning when this happens.
Hopefully these 2 changes will significantly reduce the risk of bad surprises w.r.t. handling of the qualities.
H.
Done in Biostrings 2.63.1 (see commit 18bce7ada639696e16ef1ab87f968cf12a174575).
When trimming a DNAStringSet containing qualites using
DNAStringSet(x, start=, end=)
, the quality information is lost:The quality information is present after reading the data with
readDNAStringSet()
, after trimming theelementMetadata
slot is empty.This is rather unexpected, IMHO the quality information should be retained and trimmed as well.
The environment is as follows: