Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
57 stars 16 forks source link

Streaming read sequences from a connection with readDNAStringSet #16

Closed LTLA closed 6 years ago

LTLA commented 6 years ago

Would it be much work to allow reads to be streamed in from a connection to a FASTQ file? Something like:

fhandle <- open_some_FASTQ_file("my.fastq")
first.chunk <- readDNAStringSet(fhandle, nrec=1000) # first 1000
second.chunk <- readDNAStringSet(fhandle, nrec=1000) # next 1000
# etc.
close(fhandle)

This would allow us to process reads in blocks, which would be more memory-friendly than having to read the entire FASTQ file into memory for simultaneous processing. To achieve this right now, I would need to use skip, which presumably is less efficient as it needs to re-run through the skipped records in the file.

mtmorgan commented 6 years ago

c.f. ShortRead::FastqStreamer.

LTLA commented 6 years ago

Thanks Martin, this will work perfectly.

hpages commented 6 years ago

FWIW I added support for this https://github.com/Bioconductor/Biostrings/commit/9f6894bfe61d86fdc49ba34e0aa248d1e97ee13d