Open AlexGustafsson opened 3 years ago
One solution is to reset the gzip reader, just as we do with the buffered reader. The issue is that we don't know how many bytes in the compressed form that we've read.
From the source (and documentation):
// Calling Multistream(false) disables this behavior; disabling the behavior
// can be useful when reading file formats that distinguish individual gzip
// data streams or mix gzip data streams with other data streams.
// In this mode, when the Reader reaches the end of the data stream,
// Read returns io.EOF. The underlying reader must implement io.ByteReader
// in order to be left positioned just after the gzip stream.
// To start the next stream, call z.Reset(r) followed by z.Multistream(false).
// If there is no next stream, z.Reset(r) will return io.EOF.
Perhaps we could read each stream separately - that way we could keep track of the start of the file for each record? If the gzip reader doesn't buffer too much, that is. Also, this wouldn't help us with archives that are compressed into a single gzip stream.
Although WARCs created by Larch has support for streams, we're currently unable to use it for the server.
The issue is this:
There is a library that might help us with this, but it seems rather stale and unused. It offsers a seekable gzip reader: https://pkg.go.dev/github.com/rasky/multigz.