divonlan / genozip

A modern compressor for genomic files (FASTQ, SAM/BAM/CRAM, VCF, FASTA, GFF/GTF/GVF, 23andMe...), up to 5x better than gzip and faster too
Other
159 stars 12 forks source link

Feature suggestion: Interleaved output for FASTQ #4

Closed ssadedin closed 3 years ago

ssadedin commented 4 years ago

Many aligners (eg: BWA) can accept paired end reads standard input if they are "interleaved" - that is, R1, then R2, then R1 then R2 etc.

Currently I think to run something like BWA on genozip compressed FASTQ I'd need to genounzip both of the reads FASTQS to a files, then run BWA on them both.

However - if you support a form of genocat that can output interleaved FASTQ mode then it would be possible to stream directly from genozip compressed FASTQ into BWA.

As a bonus: it would be even more useful if you can support a sharding factor which causes genocat to only output every 1 in N read pairs (could be useful for BAM or VCF as well). This allows us to run N copies of BWA from the same compressed FASTQ, and then we can merge the BAMs afterward.

NB: I tried running genozip in paired mode and then using genocat on the result, but it didn't output the reads in interleaved mode.

Any chance of implementing interleaved mode? It would mean we can run BWA directly from genozip'd FASTQ for paired end reads!

divonlan commented 3 years ago

Hi Simon, thank you for your excellent suggestions. It has been implemented and released in 9.0.13.

Interleaving: For files compressed with: genozip --pair mysample-R1.fastq mysample-R2.fastq -o mysample.R1+2.fq.genozip It is now possible to: genocat --interleave mysample.R1+2.fq.genozip The output of this can be piped into popular tools like fastp and bwa with their own option of accepting interleaved data

Downsampling: You can now genocat --downsample any file format (not yet good for sharding though as it always outputs the same one - imrpovements to come).