Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
57 stars 16 forks source link

DNAStringSet(x, start=, end=) loses quality information #61

Closed thallinger closed 2 years ago

thallinger commented 2 years ago

When trimming a DNAStringSet containing qualites using DNAStringSet(x, start=, end=), the quality information is lost:

library(Biostrings)

fl <- system.file( package="Biostrings", "extdata", "s_1_sequence.txt")
fq <- readDNAStringSet(fl, format="fastq", with.qualities=TRUE)[1:4]
print(class(elementMetadata(fq)))
# [1] "DFrame"
# attr(,"package")
# [1] "S4Vectors"
fq.trimmed <- DNAStringSet(fq, start=10, end=30)
print(class(elementMetadata(fq.trimmed)))
# [1] "NULL"

The quality information is present after reading the data with readDNAStringSet(), after trimming the elementMetadata slot is empty.

This is rather unexpected, IMHO the quality information should be retained and trimmed as well.

The environment is as follows:

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ShortRead_1.52.0            GenomicAlignments_1.30.0    SummarizedExperiment_1.24.0
 [4] Biobase_2.54.0              MatrixGenerics_1.6.0        matrixStats_0.61.0         
 [7] Rsamtools_2.10.0            GenomicRanges_1.46.1        BiocParallel_1.28.2        
[10] Biostrings_2.62.0           GenomeInfoDb_1.30.0         XVector_0.34.0             
[13] IRanges_2.28.0              S4Vectors_0.32.3            BiocGenerics_0.40.0        

loaded via a namespace (and not attached):
 [1] zlibbioc_1.40.0        lattice_0.20-45        jpeg_0.1-9             hwriter_1.3.2         
 [5] tools_4.1.2            parallel_4.1.2         grid_4.1.2             png_0.1-7             
 [9] latticeExtra_0.6-29    crayon_1.4.2           Matrix_1.3-4           GenomeInfoDbData_1.2.7
[13] RColorBrewer_1.1-2     bitops_1.0-7           RCurl_1.98-1.5         DelayedArray_0.20.0   
[17] compiler_4.1.2        
FelixErnst commented 2 years ago

I think you can consider another approach which is already implemented and also a bit more verbose in terms of classes and methods used

> fq <- readQualityScaledDNAStringSet(fl)[1:4]
> fq.trimmed <- subseq(fq, start=10, end=30)
> fq.trimmed
  A QualityScaledDNAStringSet instance containing:

DNAStringSet object of length 4:
    width seq                                                                                                       names               
[1]    21 AGGATACCCTCGCTTTCCTTC                                                                                     HWI-EAS88_1_1_1_1...
[2]    21 CCTATTAGTGGTTGAACAGCA                                                                                     HWI-EAS88_1_1_1_8...
[3]    21 TATAGTGTTATTAATATCAAT                                                                                     HWI-EAS88_1_1_1_9...
[4]    21 ATGTTATTTCTTCATTTGGAG                                                                                     HWI-EAS88_1_1_1_8...

PhredQuality object of length 4:
    width seq
[1]    21 ]]]Y]Y]]]]]]]]]]]]VCH
[2]    21 ]]]Y]]]]]]]]]YPV]T][P
[3]    21 ]V]T]]]]]T]]]]]V]TMJE
[4]    21 ]]]]]]]]]]]]]T]]]]RJR

Your approach to subsetting is reusing the constructor, whereas subseq is the method you actually want to use in this scenario.

hpages commented 2 years ago

Yes, Felix approach is the recommended way to load short DNA sequences and their qualities in Bioconductor.

Sorry I spoke too soon and I'm actually taking back my offer to modify DNAStringSet(x, start=, end=) and narrow(x, start=, end=) to also return and narrow the quality information stored in the metadata columns of DNAStringSet object x. The reason for this is that the content of the metadata columns is free, has no well-defined meaning a priori, and so cannot be controlled by the DNAStringSet class and its methods. In other words, there's no way in general to know the impact of endomorphic transformations like narrow(x, ...) or reverseComplement(x, ...) on the metadata columns of x.

Said more bluntly: the metadata columns of x are the user business, not the class business.

This is a fundamental difference between information stored in a metadata column vs information stored in a slot of the object.

QualityScaledDNAStringSet objects have a quality slot, a slot getter (quality()), and the content of the slot is well-defined and controlled by the class and its methods. Any endomorphic transformation supported by the class (e.g. narrow(), reverseComplement(), etc...) will do whatever is appropriate to the quality slot:

library(Biostrings)
dna <- DNAStringSet(c("AAA", "TTAGGGGGGTT", "CCCACCTAA"))
qual <- PhredQuality(c("+*-", "67897444445", "123456789"))
x <- QualityScaledDNAStringSet(dna, qual)

quality(x)
# PhredQuality object of length 3:
#     width seq
# [1]     3 +*-
# [2]    11 67897444445
# [3]     9 123456789

quality(narrow(x, start=3))
# PhredQuality object of length 3:
#     width seq
# [1]     1 -
# [2]     9 897444445
# [3]     7 3456789

quality(reverseComplement(x))
# PhredQuality object of length 3:
#     width seq
# [1]     3 -*+
# [2]    11 54444479876
# [3]     9 987654321

In light of this, you could argue that it's not great design that readDNAStringSet(fl, format="fastq", with.qualities=TRUE) returns a DNAStringSet object and not a QualityScaledDNAStringSet object, and maybe it should be modified to return the latter (it would still return an object that is considered to be a DNAStringSet from an is() point of view because QualityScaledDNAStringSet extends DNAStringSet). However, note that this is exactly what readQualityScaledDNAStringSet() does i.e. this function is just a thin wrapper around readDNAStringSet(fl, format="fastq", with.qualities=TRUE) that turns the DNAStringSet object and its qualities returned by the latter into a QualityScaledDNAStringSet object.

In other words, think of readDNAStringSet(fl, format="fastq", with.qualities=TRUE) as a low-level utility whose main purpose is to support readQualityScaledDNAStringSet().

I'm going to add a note to readDNAStringSet's man page about this.

As for the DNAStringSet() constructor function dropping the metadata columns of the input object, I'm going to modify it to emit a warning when this happens.

Hopefully these 2 changes will significantly reduce the risk of bad surprises w.r.t. handling of the qualities.

H.

hpages commented 2 years ago

Done in Biostrings 2.63.1 (see commit 18bce7ada639696e16ef1ab87f968cf12a174575).