caozhichongchong / QuickVariants

Fast and Accurate Variant Identification Tool for Sequencing-Based Studies
5 stars 0 forks source link

stout / stdin #1

Closed hildebra closed 1 month ago

hildebra commented 6 months ago

Hey, Very interesting program. Is there any support for outputting the vcf to stdout? (e.g. "--out-vcf -" ) and since mostly the .sam will be stored in .bam or .cram, is it possible to input the infile from stdin (piping)? E.g. "samtools view align.bam | java -Xms10g -Xmx10g -jar quick-variants-VERSION.jar --in-sam - [other options]" ? thanks, Falk

mathjeff commented 6 months ago

Thanks!

Reading alignments from stdin via something like --in-sam - sounds like a neat idea to me.

Can you tell us more about the advantages of sending the .vcf file to stdout?

I'm thinking that if the .vcf file gets sent to stdout then status messages would be more complicated - maybe the user would have to send them to a log file via something like --log somefile.log and also decide whether to look in that file

For the moment, it should be possible to send the .vcf file to stdout via something like java -jar quick-variants.jar --in-sam in.sam --out-vcf out.vcf > logfile && cat out.vcf

hildebra commented 6 months ago

Hey, thanks, yes I think reading the sam from stdin can avoid a lot of unnecessary conversion of large files to IO (and sometimes these files can be very large). And this is also the reason for using stdout - piping the vcf into a postfilter. For my own programs, I try nowadays to output status messages to stderr (see also bwa/samtools/bowtie2 etc that do something similar). In my experience, the .vcf can become too large for metagenomes to handle this on normal IO systems and frankly this can just be an unnecessary burden for the HPC systems. But I realize that if your current status messages go to stdout, can be quite a tasks redirecting all that. best, Falk

mathjeff commented 3 months ago

Hey thanks for the info - sorry for the delay.

What kinds of filtering would you like to apply to the output? I wonder if it would also be helpful to add those filters into QuickVariants.

mathjeff commented 1 month ago

In version 1.1.0 we've now added support for specifying a .sam file coming from stdin with a dash via something like cat mysam | java -jar quick-variants.jar --in-ordered-sam - ...

We've also now added some options for filtering the output .vcf file, similar to what's been supported by --out-mutations, to allow something like --out-vcf build/out.vcf --snp-threshold 5 0.9 --indel-threshold 1 0.8 --num-threads 10.

Let us know if you have any more thoughts

Thanks

hildebra commented 1 month ago

Thanks Jeff for the reply, this looks very useful. Sorry I didn't mention this earlier, but I was thinking of of a vcf format that outputs all positions (even those not calling a SNP), which could then become huge.. here piping works better as you don't need to save such a (pretty wasteful) vcf, while still reporting all positions in a gene to a follow up script. E.g. we use this "wasteful" vcf to generate consensus sequences. Will try to integrate QuickVars in the next weeks into our pipeline MG-TK. thanks, Falk