fulcrumgenomics / fgoxide

Quality of life improvements in Rust.
MIT License
4 stars 3 forks source link

Replace extension based compression detection by crates niffler #10

Open natir opened 1 year ago

natir commented 1 year ago

Hello,

I'm one of the niffler crate developers and I think you might be interested in this crate.

Niffler allows to open gzip, bzip2, lzma (xz) or zstd compressed files transparently just by calling niffler::get_reader. Format detection is based on the magic number at the beginning of the file, not the extension (no need to trust the file name).

If you're interested, I can write a pull request.

nh13 commented 1 year ago

I'd welcome a pull request, but before you do, I'd think getting a thumbs up from @tfenne makes sense

tfenne commented 1 year ago

Yes please! I'm curious, on the writer side, how things work? Do you auto-pick compression based on the extension, or do you require users to specify?

tfenne commented 1 year ago

And thank you!

natir commented 1 year ago

We require users to specify (never trust a filename)

nh13 commented 1 year ago

never trust a filename

I mostly agree, however, you have to trust them at some point (e.g. specifying the type of compression)? Perhaps if no compression type is given, we fall back on the file extension detection? And if compression is given, we check the file extension against the few known ones so they don't mismatch, but continue on if the file extension is unknown?

This would also be a great time to solve how to specify the compression parameters for a wide variety of compression types (see: https://github.com/fulcrumgenomics/fgoxide/pull/9#discussion_r1297385179). I see in niffler there are 22 levels, which is needed for zstd, but what is level 22 for zlib?

natir commented 1 year ago

The choice I've made in several applications is that if the input is compressed in one format, the output is compressed in the same format, leaving the user free to choose the output format via a parameter. As for the compression level, I've chosen to keep the default compression levels (niffler doesn't detect the compression level used).

If the user isn't satisfied with this behavior, he can send the uncompressed output as standard output and pass it on to his preferred compression tools with the parameters he has chosen.

After all, this is a library, not an application, so we don't necessarily need to make this choice right now.

About compression levels, in niffler if ever the level of compression is too high for the format, we go back to the maximum level for the chosen format.

kockan commented 1 year ago

Thanks @natir for the explanation and suggestions! I will keep #9 open for the time being, primarily for reference, but happy to close&replace it with your PR that utilizes niffler.

My personal opinion on the reader/writer side of things would be as follows:

Finally, an interesting(?) case I could think of is something like #8 , where a VCF.gz could be read as a gzipped file but most downstream tools expect it to be written as a bgzf. I might have missed it but I couldn't see a BGZF module/support in niffler. Would it make sense to add it (assuming it actually isn't there and I didn't miss it) and have a rule like "if file format is VCF, even if original compression is gzip, writer will default to bgzf" or would that be too much?