Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
57 stars 16 forks source link

Add zstd support #116

Open hjarnek opened 1 week ago

hjarnek commented 1 week ago

Hi,

It would be great with support for zstd compression and decompression of especially FASTQ files, as they can get very big with modern sequencing technologies, and zstd seems more and more like the given successor to gzip. Probably (hopefully) the field of bioinformatics will move away from gzip in the near future, and zstd is an increasingly popular candidate. It's much faster, has better compression ratio, supports multithreading natively, and comes in a well-maintained C library. Any plans to implement this?

vjcitn commented 1 week ago

I just tried it out with hg19.fa and won't bother with statistics. For compressing the large single sequence, zstd with default parameters seems very performant relative to gzip. I then asked whether it is part of the samtools/htslib stack and saw https://github.com/samtools/htslib/pull/1770, so that does not seem super favorable at the moment. It does pop up in a UKBB workflow: https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/whole-exome-sequencing-oqfe-protocol/protocol-for-processing-ukb-whole-exome-sequencing-data-sets. @hjarnek please supply some links with information on uptake in bioinformatics so that we can assess the priority of such a move.

hjarnek commented 1 week ago

I don't have any specific sources, it's just an observation that zstd is being used a lot in other contexts, and seeing as gzip is getting old compared to more modern compression algorithms, I thought zstd could be a good successor. Who knows what the field will eventually settle on. I'm a biologist, not a computer scientist, but I think it's clear that data compression is becoming increasingly valued as the amounts of data grow, also in bioinformatics, so I find it logical that people will try to move away from gzip in the near future. There are of course other fast high compression algorithms next to zstd, maybe another one is better suited. I see the discussion was going strong for a while in the GH issue related to the PR you linked, and according to a pretty graph there, zstd seems to be coming out on top also with bioinformatic data. But I'm not the right person to discuss technical details with.