Open PlatonB opened 4 years ago
@PlatonB I agree that this would be a nice feature. We're currently refactoring GATK to use a new type for tool inputs that may eventually enable us to support stdin as an input. @cmnbroad Can you comment on the feasibility of adding stdin support once the path migration is complete?
The GATKPath migration is a start, but there are other issues we'd need to address to make tools read from stdin as in the zcat example above (it tries to read from /dev/stdout, which definitely won't work, but presumably the intent was to read from stdin). For example, https://github.com/samtools/htsjdk/issues/1084.
Some tools can write to /dev/stdout now - I just tried it with PrintReads, which works fine. Of course, you have to live with the default output format (BAM in that case), and we'll never be able to read/write sibling files, such as an index.
(it tries to read from /dev/stdout, which definitely won't work, but presumably the intent was to read from stdin)
Sorry for my inattention. I fixed it.
@cmnbroad I figured I'd bump an old issue rather than create a new one, but my group would also appreciate if more gatk tools supported stdin / stdout. I've noticed that several of the Picard versions of tools support reading from stdin, but the Spark gatk replacements do not. MarkDuplicates is a big one, MarkDuplicates accepts /dev/stdin as input, but MarkDuplicateSpark does not.
For the spark tools, this may be more work because they are chunking the file and splitting it across threads / processes, but it would be great if there were a solution for GATKPath / HtsPath to identify that we're operating on stdin / stdout and not use Files.newInputStream, and instead did something like wrap System.in in a BufferedReader if that's more appropriate.
I realize that not all tools will be able to do this, because clearly you can't get random I/O to a file through a pipe, but there are plenty of tools that just read a single large file once through.
There's a collection of older issues around better stdin/stdout support or at the least documentation around this: https://github.com/broadinstitute/gatk/issues/5779 https://github.com/broadinstitute/gatk/issues/2236
@pettyalex I don't think the Spark tools such as MarkDuplicateSpark
are a likely candidate for stdin/stdout support. As you point out, they achieve parallelism by partitioning and then randomly accessing serialized input files. Even if it they could read from stdin, the benefits would be minimal, since they can't begin processing until they've seen the entire input stream, and they can't begin assembling the output until all of the worker nodes have finished processing their individual shards. So it would still require serializing the input, andI think the coarse grained process-parallelism you usually get from pipelining would be pretty minimal.
@pettyalex Progress on this has been slow, but if you have other candidate tools you'd like to see prioritized, let us know what they are.
Feature request
Tool(s) or class(es) involved
CNNScoreVariants and perhaps other tools.
Description
It would be nice to take stdout as the input. For example, when it is necessary to pass raw VCF from caller to CNNScoreVariants. In the current version, an error is produced.