Data streaming on `serratus-align` nodes

When data is decompressed via fastq-dump, on serratus-dl a named pipe is created and the output fastq file is split into 1M reads per fastq-block. This is not written to disk and instead directly piped to the s3 work bucket.

In contrast, serratus-align downloads the fastq file, or in the case of paired-end reads, 2 files to disk first. Then the file is read by bowtie2 for alignment, which produces a .bam output file. The output is then uploaded S3 work bucket again. This upload/download time leads to a drop in CPU efficiency since that worker thread is not performing alignment during that time. To increase overall efficiency, being able to stream the fastq-blocks directly from S3 into bowtie2 should significantly (~10-15% CPU usage) increase performance of these nodes.

Streaming the output bam file back to the work bucket may also be possible but since this is a reduced and compressed file format relative to fastq, it likely is overkill. In fact having partial bam-blocks on S3 in the case of a termination may lead to downstream problems and should be avoided.

Serratus-dl upload functionality starts here
Serratus-align download/get is [here]( aws s3 cp --only-show-errors $S3_FQ1 ./)
and it should get piped into the run_bowtie2.sh script below, which needs to be modified to accept pipe data and not a string/file.

Alternative is to migrate all download functions into the run_bowtie2 script and pass $FQ1 $FQ2 as $S3_FQ1 and $S3_FQ2. This is likely a cleaner solution.

ababaian / serratus

Data streaming on `serratus-align` nodes #163