RWilton / Arioc

Arioc: GPU-accelerated DNA short-read alignment
BSD 3-Clause "New" or "Revised" License
55 stars 8 forks source link

Allow named pipes as input to AriocE #18

Closed karlkashofer closed 2 years ago

karlkashofer commented 2 years ago

First of all, thanks for creating Arioc and publishing it under a permissive license !

I have seen a closed issue about fastq.gz and i am in a similar situation, i would like to remap bam files without having to go bam > uncompressed fastq > very large arioc format > uncompressed sam > bam. With my WGS data this really is a pain both in disk space and in network utilization.

I figured that i can use a named pipe to get the output file from AriocP directly into samtools, that works great. However, i could not figure out a way to do that with the input for AriocE. Whenever i use a named pipe as input it balks at me with "Bad address".

My IT guy tells me that you open the input as "file" not as "stream" so this can not easily work. Thus my question: Is there a chance you could make AriocE accept piped fastq data ? As far as i understand that its just a different way of opening the input data, i presume you don't do any random access on the input fastq(s).

This would make AriocE accept any input that one can convert to fastq and stuff into a pipe, which would be great!

RWilton commented 2 years ago

I think your suggestion about supporting piped input in AriocE is a good one. It's something we considered in the past but did not implement. Perhaps it needs to be revisited.

But first: please let me try to set out some context. It's a bit TLDR, but here goes...

1) Functionality

The primary reason why AriocE does not support piped FASTQ input has to do with the way it parses input data.

As you probably know, the FASTQ format is awful. It wastes space on redundant syntax. It fails to provide for descriptive metadata about its own contents or about the data it contains. It contains stupid syntactical elements, including a separator character that can also appear in data. In a world of XML, XSD, json, and so on, FASTQ is an archaism that is overdue for oblivion. Unfortunately, this is the world of bioinformatics, where software engineering is regarded as a lesser discipline, so we're stuck with FASTQ (and FASTA, and SAM, and VCF) for the duration.

The point of this rant is that AriocE deals with FASTQ by making two passes through FASTQ input. The first pass "sniffs" the data in order to determine things like how base quality values are encoded. It then "rewinds" and encodes the data sequentially. I suppose we could hack something together that would support this functionality with a serial input stream, but it's easier just to use a file handle and let the OS worry about buffering the input data.

2) Performance

In our experience, the conventional wisdom about piped I/O being significantly faster just doesn't hold up in practice. There are two reasons for this. One is simply that hardware and OS considerations contribute more to overall speed than does sequential I/O buffering (as in a pipe). If things are running slow and disk I/O is the bottleneck... well, fast disk devices (e.g., NVMe) are widely available and inexpensive.

The other is simply that AriocE is typically used as part of a toolchain (e.g. trimming, encoding FASTQ, aligning, filtering, etc.). Any incremental speed improvement associated with using pipes is dwarfed by the overhead of data ingress and egress, compressing and decompressing data, and running the software tools themselves. Furthermore, when the toolchain fails it can be restarted at the point of failure only if intermediate data is available in files.

3) Syntax

I suppose I can appreciate the "elegance" of writing a chain of piped commands in a Linux command line. On the other hand, I recognize the downside in regard to error handling, debugging, and so on. All things considered, supporting piped input syntax hasn't been a priority.

Having said all that: maybe at some point we can take another look at implementing AriocE to handle piped input. At the very least, the program ought perhaps to fail more gracefully if you try it!

Richard Wilton

karlkashofer commented 2 years ago

Dear Richard ! Thanks for your detailed reply, i wholeheartedly agree with your point about file formats in the biocomputing sphere !

However, I still think a pipe feature would be really nice as i could then use AriocE to stream data from nfs to the local sdd, not having to store the data on the local disk twice (input fastq and AriocE, both uncompressed).

You write:

The point of this rant is that AriocE deals with FASTQ by making two passes through FASTQ input. The first pass "sniffs" the data in order to determine things like how base quality values are encoded. It then "rewinds" and encodes the data sequentially. I suppose we could hack something together that would support this functionality with a serial input stream, but it's easier just to use a file handle and let the OS worry about buffering the input data.

How much input data do you need for the sniffing ? What are the parameters you derive in step 1 ?

A few scenarios from my simple mind:

Thanks for your time!

RWilton commented 2 years ago

Ok, thanks for clarifying how you might take advantage of an input pipe in AriocE.

In terms of disk space: we usually keep the input FASTQ files just until we are sure that AriocE has finished its work. That is, you need space for both FASTQ and AriocE's output (encoded read sequences) only until then; if you don't care about the FASTQ at that point, you can delete the FASTQ files. Similarly, when you're finished aligning, you can zap the encoded reads (AriocE's output) unless you need to re-align the same data with different parameters or to a different genome or whatever.

This might seem a bit clunky if you have one WGS sample and one computer on which to analyze it. But when you have to push multiple samples concurrently through multiple compute nodes in a cluster -- a more typical scenario for HPC applications like AriocP -- you have better control over resource allocation and job scheduling if you just drop files into your shared filesystem.

As for how to implement one-pass piped input in AriocE: I'll take a look at the code to see if there's some straightforward way to do it. Right now, however, Arioc v1.50 is just about ready for release and it's too late to think about adding this functionality, so this goes into the "good ideas" list for next time.

karlkashofer commented 2 years ago

Thanks for your reply, thats more than i hoped for ! So i am eagerly awaiting Arioc v1.51, let me know if you need beta-testers :)

karlkashofer commented 2 years ago

Hi Richard !

Thanks for the 1.50 release, running smoothly here !

I wanted to ask if there is any news on the "piped input to ariocE" front ?

Cheers, KK

Am Sonntag, dem 18.07.2021 um 12:43 -0700 schrieb RWilton:

I think your suggestion about supporting piped input in AriocE is a good one. It's something we considered in the past but did not implement. Perhaps it needs to be revisited. But first: please let me try to set out some context. It's a bit TLDR, but here goes...    1. Functionality The primary reason why AriocE does not support piped FASTQ input has to do with the way it parses input data. As you probably know, the FASTQ format is awful. It wastes space on redundant syntax. It fails to provide for descriptive metadata about its own contents or about the data it contains. It contains stupid syntactical elements, including a separator character that can also appear in data. In a world of XML, XSD, json, and so on, FASTQ is an archaism that is overdue for oblivion. Unfortunately, this is the world of bioinformatics, where software engineering is regarded as a lesser discipline, so we're stuck with FASTQ (and FASTA, and SAM, and VCF) for the duration. The point of this rant is that AriocE deals with FASTQ by making two passes through FASTQ input. The first pass "sniffs" the data in order to determine things like how base quality values are encoded. It then "rewinds" and encodes the data sequentially. I suppose we could hack something together that would support this functionality with a serial input stream, but it's easier just to use a file handle and let the OS worry about buffering the input data.    1. Performance In our experience, the conventional wisdom about piped I/O being significantly faster just doesn't hold up in practice. There are two reasons for this. One is simply that hardware and OS considerations contribute more to overall speed than does sequential I/O buffering (as in a pipe). If things are running slow and disk I/O is the bottleneck... well, fast disk devices (e.g., NVMe) are widely available and inexpensive. The other is simply that AriocE is typically used as part of a toolchain (e.g. trimming, encoding FASTQ, aligning, filtering, etc.). Any incremental speed improvement associated with using pipes is dwarfed by the overhead of data ingress and egress, compressing and decompressing data, and running the software tools themselves. Furthermore, when the toolchain fails it can be restarted at the point of failure only if intermediate data is available in files.    1. Syntax I suppose I can appreciate the "elegance" of writing a chain of piped commands in a Linux command line. On the other hand, I recognize the downside in regard to error handling, debugging, and so on. All things considered, supporting piped input syntax hasn't been a priority. Having said all that: maybe at some point we can take another look at implementing AriocE to handle piped input. At the very least, the program ought perhaps to fail more gracefully if you try it! Richard Wilton — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

RWilton commented 2 years ago

I just spent a couple of hours digging into the Arioc source code. Accommodating piped input would be tricky because there are dozens of places in the AriocE, AriocU, and AriocP implementations where the code seeks to a random position in both input and output data. These applications examine the internal structure of input files in order to infer data sizes and, if possible, to partition their contents for multithreaded, concurrent processing. For example, AriocE reads an entire FASTA-formatted reference-sequence file twice. On the first pass, it counts and parses definition lines and records sequence lengths (while taking linear and circular sequence topology into account). On the second pass, it encodes and writes both forward and reverse-complement sequences at their correct offsets in its output (.sbf) files.

Your specific suggestion about making two passes over the input data is well taken, but this approach would also negate any potential performance benefit of using piped I/O. That is, the input data would in fact need to be persisted in storage so that two-pass processing could be done. (The usual name for this kind of persisted data is "file" :-)

There's another consideration. Arioc uses a structured configuration file for runtime parameterization. It does not use command-line parameters, and command-line pipe syntax is not supported. If you have to use the config file for parameterization anyway, then there's no difference between specifying a named pipe and specifying a file.

So I think for now that piped input support is going to have to be relegated to the category of "it would be nice, but..." I totally sympathize with your dislike of filesystem clutter -- I admit that I do get annoyed when I write scripts that "forget" to delete transient files when they're no longer needed -- but we're dealing with hundreds of gigabytes of data in 1960s-era formats that contain no self-descriptive metadata and that do nothing to support concurrent processing or any other methods of high performance data access. (If it's any consolation, have a look at Ben Langmead's paper about adapting Bowtie to IBM Xeon Phi processors -- they had similar problems with FASTQ input and arrived at similar conclusions.)

karlkashofer commented 2 years ago

Dear Richard ! Thanks for your continued communication and interest in helping me and us as a community !

Just to clarify my use case. In production use i would push 100GB fastqs to one cluster node to map with arioc. Pushing that amount of data takes about 30 minutes on our network. After transfer i do the AriocE step which also takes about 30 minutes. In sum thats an hour. If i could use (named) pipes for AriocE i could do the two steps in parallel and thus in 30 minutes, i.e. half the time.

How about the

Add parameters into the AriocE config file that requests pipe input and includes all the info you need about the fastq ? If that parameter is present, then skip step 1 and open as stream ?

idea ?

RWilton commented 2 years ago

Hello, Karl --

Before I answer your question, I think I need to explain a bit of our design "philosophy" (or "prejudices" if you prefer).

We built Arioc with the assumption that it would not be used for one-off alignment of a few reads or even of a single WGS sequencer run. Instead, we looked at how people with dozens or hundreds of such WGS samples would handle multi-terabyte or even petabyte-scale sequence-alignment tasks. What we saw almost universally was FASTQ data residing in a network filesystem shared among multiple computers, each provisioned with a variety of resources (hundreds of gigabytes of system RAM, fast local storage, multicore CPUs, and of course GPUs).

The way to get things done efficiently (minimal cost, highest throughput) in this kind of heterogenous environment is to use available hardware resources concurrently and to allocate appropriate hardware resources to specific computational tasks. In the case of short-read alignment, that means doing the alignment computations on costlier machines provisioned with GPUs and lots of CPU threads, and reserving the remaining work for less costly, shared resources.

For Arioc in particular, this means that we parse FASTQ separately rather than allocating CPU threads for this within the aligner application. We still want FASTQ parsing and alignment to occur concurrently -- but for Arioc, it's concurrent only when two or more different samples are being aligned. This works out well in practice because FASTQ parsing uses only a few CPU threads and only a modest amount of system RAM, so it gets done on inexpensive compute nodes while alignment proceeds on costlier GPU nodes.

Here is what that looks like at scale:

  Workflow pileup

Each of the 645 horizontal line segments represents a workflow for one set of reads:

On average, only about 10% of each workflow was attributable to AriocP running on a GPU machine. Everything else was allocated to nodes with shared CPU and memory. The point is that concurrent execution on multiple nodes is where you save overall elapsed time when you're processing hundreds of sequencer runs.

So: again, your point is well taken. Indeed, the first sample will take 30 minutes to transfer before AriocE will start working on it. But while the second sample is transferring, Arioc is processing the first sample concurrently. And by the tenth sample, you'll have other performance considerations to worry about besides the first sample's 30 "wasted" minutes :-)

(And by the way, if you're really planning to process a lot of full-size WGS or WGBS FASTQ-formatted data, that 55 or 60 megabytes/second transfer rate may be a bottleneck!

· rw