Payloads in compressed WARCs are not lazily readable

AlexGustafsson commented 3 years ago

Although WARCs created by Larch has support for streams, we're currently unable to use it for the server.

The issue is this:

We use a ReadSeeker to be able to scrub in a stream
If the archive is compressed, the ReadSeker is wrapped with a GzipReader
Using the ReadSeeker or GzipReader, a bufio.Reader is created - this offers the most convenience, flexibility and support
When a header is read, we'd like to find the offset in the file at which it starts. It is possible in a raw file since the read bytes correspond to the bytes in the file. This is not true for gzipped files. That is, the offset we find is that of the decompressed content

There is a library that might help us with this, but it seems rather stale and unused. It offsers a seekable gzip reader: https://pkg.go.dev/github.com/rasky/multigz.

AlexGustafsson commented 3 years ago

One solution is to reset the gzip reader, just as we do with the buffered reader. The issue is that we don't know how many bytes in the compressed form that we've read.

AlexGustafsson commented 3 years ago

From the source (and documentation):

// Calling Multistream(false) disables this behavior; disabling the behavior
// can be useful when reading file formats that distinguish individual gzip
// data streams or mix gzip data streams with other data streams.
// In this mode, when the Reader reaches the end of the data stream,
// Read returns io.EOF. The underlying reader must implement io.ByteReader
// in order to be left positioned just after the gzip stream.
// To start the next stream, call z.Reset(r) followed by z.Multistream(false).
// If there is no next stream, z.Reset(r) will return io.EOF.

See https://github.com/golang/go/blob/724d0720b3e110f64598bf789cbe2a6a1b3b0fd8/src/compress/gzip/gunzip.go#L125.

Perhaps we could read each stream separately - that way we could keep track of the start of the file for each record? If the gzip reader doesn't buffer too much, that is. Also, this wouldn't help us with archives that are compressed into a single gzip stream.

AlexGustafsson / larch

Payloads in compressed WARCs are not lazily readable #16