Stdin support - Githubissues

PlatonB commented 4 years ago

Feature request

Tool(s) or class(es) involved

CNNScoreVariants and perhaps other tools.

Description

It would be nice to take stdout as the input. For example, when it is necessary to pass raw VCF from caller to CNNScoreVariants. In the current version, an error is produced.

zcat /home/platon/Dissertation/Exp/ngs_test/no_filtered.vcf.gz | gatk CNNScoreVariants \
-V /dev/stdin \
-R /home/platon/Dissertation/Exp/ngs_test/homo_sapiens/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
-O /home/platon/Dissertation/Exp/Output

Using GATK jar /home/platon/miniconda3/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/platon/miniconda3/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar CNNScoreVariants -V /dev/stdin -R /home/platon/Dissertation/Exp/ngs_test/homo_sapiens/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz -O /home/platon/Dissertation/Exp/Output
18:04:27.033 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/platon/miniconda3/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
Aug 26, 2020 6:04:27 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
18:04:27.246 INFO  CNNScoreVariants - ------------------------------------------------------------
18:04:27.246 INFO  CNNScoreVariants - The Genome Analysis Toolkit (GATK) v4.1.8.1
18:04:27.246 INFO  CNNScoreVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
18:04:27.246 INFO  CNNScoreVariants - Executing as platon@platon-VivoBook-ASUSLaptop-X712FA-X712FA on Linux v5.4.0-42-generic amd64
18:04:27.246 INFO  CNNScoreVariants - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_192-b01
18:04:27.247 INFO  CNNScoreVariants - Start Date/Time: 26 августа 2020 г. 18:04:27 MSK
18:04:27.247 INFO  CNNScoreVariants - ------------------------------------------------------------
18:04:27.247 INFO  CNNScoreVariants - ------------------------------------------------------------
18:04:27.247 INFO  CNNScoreVariants - HTSJDK Version: 2.23.0
18:04:27.247 INFO  CNNScoreVariants - Picard Version: 2.22.8
18:04:27.247 INFO  CNNScoreVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
18:04:27.247 INFO  CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
18:04:27.247 INFO  CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
18:04:27.247 INFO  CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
18:04:27.247 INFO  CNNScoreVariants - Deflater: IntelDeflater
18:04:27.247 INFO  CNNScoreVariants - Inflater: IntelInflater
18:04:27.247 INFO  CNNScoreVariants - GCS max retries/reopens: 20
18:04:27.247 INFO  CNNScoreVariants - Requester pays: disabled
18:04:27.247 INFO  CNNScoreVariants - Initializing engine
18:04:27.481 INFO  CNNScoreVariants - Shutting down engine
[26 августа 2020 г. 18:04:27 MSK] org.broadinstitute.hellbender.tools.walkers.vqsr.CNNScoreVariants done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=362807296
***********************************************************************

A USER ERROR has occurred: Couldn't read file file:///dev/stdin. Error was: It isn't a regular file

***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

droazen commented 4 years ago

@PlatonB I agree that this would be a nice feature. We're currently refactoring GATK to use a new type for tool inputs that may eventually enable us to support stdin as an input. @cmnbroad Can you comment on the feasibility of adding stdin support once the path migration is complete?

cmnbroad commented 4 years ago

The GATKPath migration is a start, but there are other issues we'd need to address to make tools read from stdin as in the zcat example above (it tries to read from /dev/stdout, which definitely won't work, but presumably the intent was to read from stdin). For example, https://github.com/samtools/htsjdk/issues/1084.

Some tools can write to /dev/stdout now - I just tried it with PrintReads, which works fine. Of course, you have to live with the default output format (BAM in that case), and we'll never be able to read/write sibling files, such as an index.

PlatonB commented 4 years ago

(it tries to read from /dev/stdout, which definitely won't work, but presumably the intent was to read from stdin)

Sorry for my inattention. I fixed it.

pettyalex commented 2 years ago

@cmnbroad I figured I'd bump an old issue rather than create a new one, but my group would also appreciate if more gatk tools supported stdin / stdout. I've noticed that several of the Picard versions of tools support reading from stdin, but the Spark gatk replacements do not. MarkDuplicates is a big one, MarkDuplicates accepts /dev/stdin as input, but MarkDuplicateSpark does not.

For the spark tools, this may be more work because they are chunking the file and splitting it across threads / processes, but it would be great if there were a solution for GATKPath / HtsPath to identify that we're operating on stdin / stdout and not use Files.newInputStream, and instead did something like wrap System.in in a BufferedReader if that's more appropriate.

I realize that not all tools will be able to do this, because clearly you can't get random I/O to a file through a pipe, but there are plenty of tools that just read a single large file once through.

There's a collection of older issues around better stdin/stdout support or at the least documentation around this: https://github.com/broadinstitute/gatk/issues/5779 https://github.com/broadinstitute/gatk/issues/2236

cmnbroad commented 2 years ago

@pettyalex I don't think the Spark tools such as MarkDuplicateSpark are a likely candidate for stdin/stdout support. As you point out, they achieve parallelism by partitioning and then randomly accessing serialized input files. Even if it they could read from stdin, the benefits would be minimal, since they can't begin processing until they've seen the entire input stream, and they can't begin assembling the output until all of the worker nodes have finished processing their individual shards. So it would still require serializing the input, andI think the coarse grained process-parallelism you usually get from pipelining would be pretty minimal.

cmnbroad commented 2 years ago

@pettyalex Progress on this has been slow, but if you have other candidate tools you'd like to see prioritized, let us know what they are.

broadinstitute / gatk

Stdin support #6749

Feature request

Tool(s) or class(es) involved

Description