FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
468 stars 151 forks source link

Using TrimGalore with named pipes as I/O #49

Closed rob-p closed 5 years ago

rob-p commented 5 years ago

Hi @FelixKrueger,

I'm trying to understand the best way to put together a pipeline (note, the real work here is being done by @cgreene & @greenelab) that consumes an SRA file, writes the fastq files to named pipes, then adapter (and light quality) trims the reads, and then quantifies them using salmon (with mapping validation). Casey and his group have figured out how to cajole sra-dump to write the resulting fastq reads to a named pipe, and was wondering if there is a way for trim galore to consume these and write its outputs to a named pipe. This would allow avoiding intermediate disk i/o, which, as I understand, is quite expensive on the cloud platforms on which this pipeline will primarily be run.

Thanks! Rob

FelixKrueger commented 5 years ago

Hi @rob-p,

Thanks for your comments. I have to admit that I have so far not come across the idea of reading from, or writing to, named pipes instead of writing to files. It is conceptually probably not that hard to understand, I am however somewhat concerned that this concept might not work for Trim Galore as it stands, without some major refactoring.

I guess for standard, single-end files it would probably work fairly quickly, but it might not be as straight forward for both paired-end or RRBS mode.

Is the cloud really that expensive, even if you require the space only for a few minutes/hours?

cgreene commented 5 years ago

@FelixKrueger : I can fill in a little bit. Here's our solution that uses named pipes: https://github.com/AlexsLemonade/refinebio/pull/1106

Unfortunately the cloud architecture is not performant with many writes to disk (remember, the disks are not directly attached to the machines), and so the intermediate file workflow sounds like it'd be a performance killer for us.

FelixKrueger commented 5 years ago

It appears that fastp is now being used for trimming purposes in refinebio, so we can close this for the time being.