When data is decompressed via fastq-dump, on serratus-dl a named pipe is created and the output fastq file is split into 1M reads per fastq-block. This is not written to disk and instead directly piped to the s3 work bucket.
In contrast, serratus-align downloads the fastq file, or in the case of paired-end reads, 2 files to disk first. Then the file is read by bowtie2 for alignment, which produces a .bam output file. The output is then uploaded S3 work bucket again. This upload/download time leads to a drop in CPU efficiency since that worker thread is not performing alignment during that time. To increase overall efficiency, being able to stream the fastq-blocks directly from S3 into bowtie2 should significantly (~10-15% CPU usage) increase performance of these nodes.
Streaming the output bam file back to the work bucket may also be possible but since this is a reduced and compressed file format relative to fastq, it likely is overkill. In fact having partial bam-blocks on S3 in the case of a termination may lead to downstream problems and should be avoided.
Serratus-align download/get is [here]( aws s3 cp --only-show-errors $S3_FQ1 ./)
and it should get piped into the run_bowtie2.sh script below, which needs to be modified to accept pipe data and not a string/file.
Alternative is to migrate all download functions into the run_bowtie2 script and pass $FQ1 $FQ2 as $S3_FQ1 and $S3_FQ2. This is likely a cleaner solution.
When data is decompressed via
fastq-dump
, onserratus-dl
a named pipe is created and the output fastq file issplit
into 1M reads per fastq-block. This is not written to disk and instead directly piped to the s3 work bucket.In contrast,
serratus-align
downloads the fastq file, or in the case of paired-end reads, 2 files to disk first. Then the file is read bybowtie2
for alignment, which produces a.bam
output file. The output is then uploaded S3 work bucket again. This upload/download time leads to a drop in CPU efficiency since that worker thread is not performing alignment during that time. To increase overall efficiency, being able to stream thefastq-blocks
directly from S3 into bowtie2 should significantly (~10-15% CPU usage) increase performance of these nodes.Streaming the output bam file back to the work bucket may also be possible but since this is a reduced and compressed file format relative to fastq, it likely is overkill. In fact having partial bam-blocks on S3 in the case of a termination may lead to downstream problems and should be avoided.
Serratus-dl
upload functionality starts hereSerratus-align
download/get is [here]( aws s3 cp --only-show-errors $S3_FQ1 ./)and it should get piped into the
run_bowtie2.sh
script below, which needs to be modified to accept pipe data and not a string/file.Alternative is to migrate all download functions into the
run_bowtie2
script and pass $FQ1 $FQ2 as $S3_FQ1 and $S3_FQ2. This is likely a cleaner solution.