ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
251 stars 33 forks source link

Data streaming on `serratus-align` nodes #163

Open ababaian opened 4 years ago

ababaian commented 4 years ago

When data is decompressed via fastq-dump, on serratus-dl a named pipe is created and the output fastq file is split into 1M reads per fastq-block. This is not written to disk and instead directly piped to the s3 work bucket.

In contrast, serratus-align downloads the fastq file, or in the case of paired-end reads, 2 files to disk first. Then the file is read by bowtie2 for alignment, which produces a .bam output file. The output is then uploaded S3 work bucket again. This upload/download time leads to a drop in CPU efficiency since that worker thread is not performing alignment during that time. To increase overall efficiency, being able to stream the fastq-blocks directly from S3 into bowtie2 should significantly (~10-15% CPU usage) increase performance of these nodes.

Streaming the output bam file back to the work bucket may also be possible but since this is a reduced and compressed file format relative to fastq, it likely is overkill. In fact having partial bam-blocks on S3 in the case of a termination may lead to downstream problems and should be avoided.

Alternative is to migrate all download functions into the run_bowtie2 script and pass $FQ1 $FQ2 as $S3_FQ1 and $S3_FQ2. This is likely a cleaner solution.