Closed ikreymer closed 1 year ago
I'll preface by saying that this is a very niche use case, so even if it is easy to implement I'll have to weigh the bundle size costs to see if it's worth adding. That being said, this might be possible to add to the streaming API, i.e. fflate.Gunzip
. I'll let you know if it seems feasible when I can.
Great, thanks for taking a look! If there's at least a way to get the amount of data consumed after first member (couldn't find that in the current state object), the rest could even be implemented on my end. Besides my use case, it looks like it has come up for other use cases, at least in pako (eg. bgzip, https://github.com/nodeca/pako/issues/139)
even if it is easy to implement I'll have to weigh the bundle size costs to see if it's worth adding
That's judicious. I'd like to elaborate on another use case, to outline this feature's potential value. In addition to the original description's use case for WARC, as mentioned above, this feature would help efficiently implement bgzip in JavaScript. That's an important algorithm for biology and medicine.
Enabling bgzip via fflate would improve speed and maintainability for bioinformatics applications. For example, bgzip allows genome visualization packages like JBrowse and igv.js to quickly load segments of the human genome or other genomic data, helping scientists assess DNA samples relating to cancer and other genetic diseases.
Currently for those use cases, genomics JS packages depend on pako
(e.g. via bgzf-filehandle) or write custom bgzip handlers. As someone looking to use bgzip for another genomics web application, I'd ideally like to use something that has a faster, smaller, and more reusable core -- like fflate.
Support for reading multi-member gzip files or providing access to remaining data would presumably enable packages like bgzf-filehandle or others to build atop fflate instead of pako. So, for a range of biological use cases, that'd save scientists time on every page load by having their browsers parse less JS, and speed up development by requiring bioinformatics engineers to only know fflate and not also pako for JavaScript that deals with compression and decompression.
I hope that helps explain some additional value in making fflate
more versatile through this new feature!
I had forgotten about this issue but what you've proposed does seem compelling. I'll look into the bgzip
spec and try to implement it for the next release.
OK after looking at the requirements for this it's actually not too difficult. Support for GZIP extra fields will need to be added on the compression side if you want to create bgzip
in the browser, but that shouldn't affect bundle size too much. (Though the advertised 8kB is already a bit misleading if you don't tree shake.)
@ikreymer if you still need this for WARC, could you explain why exactly you need access to the byte offsets of each new member? At the moment I'm thinking of simply allowing you to push after the final block. Also @eweitz random access into the GZIP from a .gzi
file could simply involve the user starting the stream passed to fflate
at the byte offset they choose.
@ikreymer if you still need this for WARC, could you explain why exactly you need access to the byte offsets of each new member? At the moment I'm thinking of simply allowing you to push after the final block. Also @eweitz random access into the GZIP from a
.gzi
file could simply involve the user starting the stream passed tofflate
at the byte offset they choose.
This is needed to be able to create an index of the records in the WARC file, which are kept in separate file/data structure. The index is created once by reading the entire WARC, but after that, the WARC is typically accessed via random access/seeking to a single member and inflating that (eg. by performing an HTTP range request for just the data for a single member). To be able to create an index in the browser, need to be able to get the offsets of each member.
Bump for simple multi-member decompression support?
Added support for this and releasing in v0.8.0. The implementation transparently decompresses concatenated GZIP archives (as the gzip
CLI tool does) and provides an onmember
handler for Gunzip
and AsyncGunzip
.
Added support for this and releasing in v0.8.0. The implementation transparently decompresses concatenated GZIP archives (as the
gzip
CLI tool does) and provides anonmember
handler forGzip
andAsyncGzip
.
Great! Didn't see the onmember
handler yet, as soon as that's added, can try! Just need the offset for the start of each new member.
v0.8.0 published with these changes. Let me know if you find any issues!
What can't you do right now?
Gzip supports having 'multi-member' gzip files, where essentially gzip files are concatenated one after the other. This is used in certain formats, such as WARC
An optimal solution
An optimal solution would be for fflate Inflate to provide a way parse mutli-member gzip files by providing an option, and an additional callback when a new member is started (as well as the offset of the member into the stream).
Another option is to provide an offset into the buffer consumed by reading the gzip, allowing the developer to manually create a new Inflate object.
(How) is this done by other libraries?
pako provides a
avail_in
counter which keeps track of how many bytes have not yet been consumed. One approach I've used is something like this: https://github.com/webrecorder/warcio.js/blob/main/src/readers.js#L282 (though this is with an earlier version of pako). Pako in latest version may try to read the multi-member gzips as one buffer, though it seems it doesn't always work (in my tests)A key my use case is to be able to get an offset to the beginning of each member, and flush the data buffer at the end of each member.
Ideally, there could be a callback that indicates when a new member has been started and the offset at that new member:
The ondata callbacks after onnewgzipmember are assumed to be from the gzip member, and ondata always flushes when the member boundary is reached.