Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
57 stars 16 forks source link

readDNAStringSet for FASTQ files with variable length reads #8

Closed LTLA closed 6 years ago

LTLA commented 6 years ago

It seems as if variable length reads in FASTQ files are not yet supported. For example, if I have a FASTQ file named blah.fq with the following contents:

@READ1
ACTAGCGACACT
+
981731287232
@READ2
CTAGACT
+
3723123

... and I run (on Biostrings version 2.47.9):

library(Biostrings)
readDNAStringSet("blah.fq", format="fastq")

... I get the error:

Error in .Call2("read_XStringSet_from_fastq", filexp_list, nrec, skip,  : 
  read_XStringSet_from_fastq(): FASTQ files with variable sequence lengths are not supported yet

Is support for variable-length FASTQ files on the radar? This is the typical case when dealing with Nanopore data. I know I can use the ShortRead package but it seems unnecessary to reach for a different package when there seems to be a perfectly suitable function in Biostrings.

hpages commented 6 years ago

Hi Aaron, I added support for FASTQ files with variable-length reads in Biostrings 2.47.10. Let me know if you run into problems. H.

LTLA commented 6 years ago

Sweet, thanks Herve.

LTLA commented 6 years ago

Ah - whoops - now that I actually check it, though, I get:

X <- readDNAStringSet("blah.fq", format="fastq")
X
##   A DNAStringSet instance of length 2
##     width seq                                               names               
## [1]    12 ACTAGCGACACT                                      READ1
## [2]     7 CTAGACT                                           READ2

... so the quality scores haven't been pulled in.

hpages commented 6 years ago

readDNAStringSet() has been ignoring the qualities so far. I'm currently working on changing this (as requested a long time ago by Erik Wright, the author of the DECIPHER package, but I never really had the time to get to this).

hpages commented 6 years ago

Hi Aaron @LTLA , This is now in Biostrings 2.47.11: https://github.com/Bioconductor/Biostrings/commit/9452ca25cd14af6ab8c5b520b74d0f7f6ba6d070 See also ?readDNAStringSet. H.

LTLA commented 6 years ago

Hm. I'm still getting this with 2.47.11, using the blah.fq example in the original post:

readDNAStringSet("blah.fq", format="fastq", with.qualities=TRUE)
##   A DNAStringSet instance of length 2
##     width seq                                               names               
## [1]    12 ACTAGCGACACT                                      READ1
## [2]     7 CTAGACT                                           READ2

Am I still missing some arguments?

hpages commented 6 years ago

The show method for XStringSet objects does not display the metadata columns but they are here. See the examples in ?readDNAStringSet (section "B. READ/WRITE FASTQ FILES").

LTLA commented 6 years ago

Right. Perhaps it would be more intuitive to output a QualityScaledDNAStringSet, especially as this can be directly used in further Biostrings functions.

hpages commented 6 years ago

I would need a way to know whether the qualities in the file must go in a PhredQuality, SolexaQuality, or IlluminaQuality object. I could add an extra argument (e.g. quality.type) to readDNAStringSet() to let the user tell me that. For now my preference is to keep readDNAStringSet() low-level. It's easy enough for the user to do something like:

reads <- readDNAStringSet(filepath, format="fastq", with.qualities=TRUE)
quals <- PhredQuality(mcols(reads)$qualities)  # or SolexaQuality(), or IlluminaQuality()
QualityScaledDNAStringSet(reads, quals)

I added this example to the man page: https://github.com/Bioconductor/Biostrings/commit/9e92e22257dd985710f6bb8cd105c99cd14b1531

Hope that works for you.

LTLA commented 6 years ago

Okay, fair enough. Thanks Herve.

hpages commented 6 years ago

After giving it a 2nd thought, I decided to go for this: https://github.com/Bioconductor/Biostrings/commit/4ed20b48b3c64647638e46f0729301d152b39297 It provides the functionality you were looking for but via more specialized functions. readDNAStringSet() and writeXStringSet() remain low-level and I don't need to make their interfaces more complicated than they already are.

LTLA commented 6 years ago

Sweet, I can see myself using this often, thanks.