101arrowz / fflate

High performance (de)compression in an 8kB package
https://101arrowz.github.io/fflate
MIT License
2.28k stars 79 forks source link

Support reading multi-member gzip files or providing access to remaining data #102

Closed ikreymer closed 1 year ago

ikreymer commented 2 years ago

What can't you do right now?

Gzip supports having 'multi-member' gzip files, where essentially gzip files are concatenated one after the other. This is used in certain formats, such as WARC

An optimal solution

An optimal solution would be for fflate Inflate to provide a way parse mutli-member gzip files by providing an option, and an additional callback when a new member is started (as well as the offset of the member into the stream).

Another option is to provide an offset into the buffer consumed by reading the gzip, allowing the developer to manually create a new Inflate object.

(How) is this done by other libraries?

pako provides a avail_in counter which keeps track of how many bytes have not yet been consumed. One approach I've used is something like this: https://github.com/webrecorder/warcio.js/blob/main/src/readers.js#L282 (though this is with an earlier version of pako). Pako in latest version may try to read the multi-member gzips as one buffer, though it seems it doesn't always work (in my tests)

A key my use case is to be able to get an offset to the beginning of each member, and flush the data buffer at the end of each member.

Ideally, there could be a callback that indicates when a new member has been started and the offset at that new member:

onnewgzipmember: OnNewGzipMemberCallback
OnNewGzipMemberCallback = (offset: number) => void

The ondata callbacks after onnewgzipmember are assumed to be from the gzip member, and ondata always flushes when the member boundary is reached.

101arrowz commented 2 years ago

I'll preface by saying that this is a very niche use case, so even if it is easy to implement I'll have to weigh the bundle size costs to see if it's worth adding. That being said, this might be possible to add to the streaming API, i.e. fflate.Gunzip. I'll let you know if it seems feasible when I can.

ikreymer commented 2 years ago

Great, thanks for taking a look! If there's at least a way to get the amount of data consumed after first member (couldn't find that in the current state object), the rest could even be implemented on my end. Besides my use case, it looks like it has come up for other use cases, at least in pako (eg. bgzip, https://github.com/nodeca/pako/issues/139)

eweitz commented 2 years ago

even if it is easy to implement I'll have to weigh the bundle size costs to see if it's worth adding

That's judicious. I'd like to elaborate on another use case, to outline this feature's potential value. In addition to the original description's use case for WARC, as mentioned above, this feature would help efficiently implement bgzip in JavaScript. That's an important algorithm for biology and medicine.

Enabling bgzip via fflate would improve speed and maintainability for bioinformatics applications. For example, bgzip allows genome visualization packages like JBrowse and igv.js to quickly load segments of the human genome or other genomic data, helping scientists assess DNA samples relating to cancer and other genetic diseases.

Currently for those use cases, genomics JS packages depend on pako (e.g. via bgzf-filehandle) or write custom bgzip handlers. As someone looking to use bgzip for another genomics web application, I'd ideally like to use something that has a faster, smaller, and more reusable core -- like fflate.

Support for reading multi-member gzip files or providing access to remaining data would presumably enable packages like bgzf-filehandle or others to build atop fflate instead of pako. So, for a range of biological use cases, that'd save scientists time on every page load by having their browsers parse less JS, and speed up development by requiring bioinformatics engineers to only know fflate and not also pako for JavaScript that deals with compression and decompression.

I hope that helps explain some additional value in making fflate more versatile through this new feature!

101arrowz commented 2 years ago

I had forgotten about this issue but what you've proposed does seem compelling. I'll look into the bgzip spec and try to implement it for the next release.

101arrowz commented 2 years ago

OK after looking at the requirements for this it's actually not too difficult. Support for GZIP extra fields will need to be added on the compression side if you want to create bgzip in the browser, but that shouldn't affect bundle size too much. (Though the advertised 8kB is already a bit misleading if you don't tree shake.)

@ikreymer if you still need this for WARC, could you explain why exactly you need access to the byte offsets of each new member? At the moment I'm thinking of simply allowing you to push after the final block. Also @eweitz random access into the GZIP from a .gzi file could simply involve the user starting the stream passed to fflate at the byte offset they choose.

ikreymer commented 2 years ago

@ikreymer if you still need this for WARC, could you explain why exactly you need access to the byte offsets of each new member? At the moment I'm thinking of simply allowing you to push after the final block. Also @eweitz random access into the GZIP from a .gzi file could simply involve the user starting the stream passed to fflate at the byte offset they choose.

This is needed to be able to create an index of the records in the WARC file, which are kept in separate file/data structure. The index is created once by reading the entire WARC, but after that, the WARC is typically accessed via random access/seeking to a single member and inflating that (eg. by performing an HTTP range request for just the data for a single member). To be able to create an index in the browser, need to be able to get the offsets of each member.

OzySky commented 1 year ago

Bump for simple multi-member decompression support?

101arrowz commented 1 year ago

Added support for this and releasing in v0.8.0. The implementation transparently decompresses concatenated GZIP archives (as the gzip CLI tool does) and provides an onmember handler for Gunzip and AsyncGunzip.

ikreymer commented 1 year ago

Added support for this and releasing in v0.8.0. The implementation transparently decompresses concatenated GZIP archives (as the gzip CLI tool does) and provides an onmember handler for Gzip and AsyncGzip.

Great! Didn't see the onmember handler yet, as soon as that's added, can try! Just need the offset for the start of each new member.

101arrowz commented 1 year ago

v0.8.0 published with these changes. Let me know if you find any issues!