Streaming decompress into fragmented (but stable) output without history buffer?

Thank you for this amazingly powerful library. I think we (at https://github.com/vectorizedio/redpanda) may have a unique use case which appears to almost be covered by the API, but there is still an open question about a viable solution, and I'm hoping to get some advice on the limits of the current API.

tl;dr: is it possible to use ZSTD_c_stableOutBuffer with a stable (but fragmented) output buffer? we operate in an environment where memory fragmentation prevents on-demand allocation of contiguous regions (even 2 MB) which would otherwise be needed in our use case for the streaming decompression history buffer. We don't control the compression process, and while we do impose limits, when those limits derive from our inability to acquire contiguous memory regions > 1-2 MB the restrictions are too severe.

full details of use case

We are using zstd to decompress source data that is stored in a heap allocated fragmented buffer (not folly::IOBuf chains, but similar idea):

source = [buf-0, buf-1, ...]

because the source data is not stored in contiguous memory we cannot use the single-shot API, and so we rely on streaming decompression where the output data is also stored in a fragmented buffer.

This all works fine, and so far it seems this is a completely normal use of the zstd streaming API.

The problem we are facing arises with the size requirements of the history buffer that streaming decompression requires. Specifically we care about limiting the size of any contiguous region of memory.

It is probably relevant now to state that we do not control the compression process, so we can't enforce specific limits on the window size. Because we can't control the window size, my understanding is that even the buffer-less API is not helpful because the caller must still provide a contiguous memory region large enough for the history which in turn is dependent on the compression process.

I would expect at this point the suggestion to be to limit window size, and we do up to a reasonable point.

However, we operate in an environment where memory becomes fragmented over time and we cannot rely on the OS to provide us with a contiguous region via virtual memory on demand. The fragmentation becomes bad enough that it can become hard to find even 1 to 2 MBs of contiguous memory. For us this would translate into an unreasonable restriction on the clients' compression process.

I thought for sure we'd find a solution looking at the use of zstd in the Linux kernel, but I found that zstd in that environment makes use of vmalloc so it doesn't have this issue :)

The current solution we are using is to statically allocate a sufficiently large buffer (say 4 or 8 MBs etc...) during boot when memory is not fragmented, and then using the zstd static init experimental API to provide this to the streaming decompression API.

This isn't ideal because we need to reserve based on a worst case scenarios (up to some limit), and more important, this memory in pinned forever, even though clients may never send us compressed data.

I recently stumbled upon the PR https://github.com/facebook/zstd/pull/2094 that introduced ZSTD_c_stableOutBuffer and from what I can tell it solves the issue of the internal history buffer by relying on a stable output.

Can ZSTD_c_stableOutBuffer work with an output buffer that is fragmented? The output buffers can themselves remain stable in memory, but the entirety of the decompressed output would need to be fragmented in (potentially) arbitrary ways, though we could add a stricter allocation policy if necessary.

Are there other approaches to solving the problem we are facing?

@dotnwat,

Thanks for providing so much context!

I'm sorry to say that in general, the answer to your question is no. And unfortunately, it's not just an API limitation: the decoder implementation is written in such a way that it cannot decompress with a fragmented history buffer.

Fundamentally the zstd decoder is performing two operations in a loop, decoding sequences and executing them.

Decoding a sequence involves recovering the LZ77 literal_length, match_length, and match_offset values from the entropy-encoded representation they're stored in (ANS Encoding via FSE).
Executing a sequence is straightforward LZ77 decoding, and involves copying the next literal_length bytes from the decoded literals buffer (previously recovered from their Huffman-encoded representation) onto the tail of the output and then copying match_length bytes from match_offset bytes back in the history of the stream onto the tail of the output.

The relevant part here is the implementation of the match copy operation. Since this is part of the hot loop / core of the most performance-sensitive part of Zstd, we want the lookup of the match position from the decoded offset to be as fast as possible, basically just current_position - match_offset. The cost of this fast mapping is that it requires that the whole window is contiguous...

Except technically, we do actually implement an exception to this. The history buffer is allowed to have a single discontinuity. In order to efficiently maintain a window-sized view of an arbitrarily large stream, the internal history buffer is a circular buffer (sized to the window size), which as it wraps around will map the window into two chunks. So the decoder is implemented to handle that. That's probably not sufficient for your use case, though, even if that support were plumbed through to external history buffers.

Someone (you? us?) could potentially write a decoder implementation that supported arbitrary fragmentation at the cost of slower execution, but from an external view, the approach you're taking now is probably your most realistic option.

I hope that helps!

This isn't ideal because we need to reserve based on a worst case scenarios (up to some limit), and more important, this memory in pinned forever, even though clients may never send us compressed data.

If you just malloc() it but never use it, actual memory will never get allocated to back it. So you'll just be using the address space. Although I guess you're saying you're running out of that. Which really surprises me! Are you using the stock libc allocator? You might want to try https://github.com/jemalloc/jemalloc. It's pretty great. (Disclaimer: it's maintained by my teammates.)

wow @felixhandte thanks for the details and I really appreciate the quick response.

performance-sensitive part of Zstd, we want the lookup of the match position from the decoded offset to be as fast as possible, basically just current_position - match_offset. The cost of this fast mapping is that it requires that the whole window is contiguous...

this was my hunch, but without having seen all of the zstd internals, I thought I should ask about fragmented memory :)

Someone (you? us?) could potentially write a decoder implementation that supported arbitrary fragmentation at the cost of slower execution.

Do you have a sense for how much work that might be? IIUC this would effectively be an integration into a (new) slow path some array / index that recorded the fragment boundaries and an update to the offset math to consult this index?

I think the performance impacts would be interesting, and something we do care about a lot. In general we would be able to provide large-ish fragments say 32 KB to 128 KB, but under extreme conditions we might degrade to smaller sizes and still want to make progress.

If you just malloc() it but never use it, actual memory will never get allocated to back it. So you'll just be using the address space.

You're right. But the concern wasn't so much that it would never be used, but rather if in the lifetime of the process it's used once we'd not be able to reclaim the space independent of future workload behavior.

Although I guess you're saying you're running out of that. Which really surprises me! Are you using the stock libc allocator? You might want to try https://github.com/jemalloc/jemalloc. It's pretty great. (Disclaimer: it's maintained by my teammates.)

Yes it is surprising! We use a system called Seastar (https://github.com/scylladb/seastar) to achieve as much kernel bypass as possible for low-latency (thread-per-core, user-space networking, etc...). One thing it does that is relevant here is that it allocates all its memory on start-up to avoid any type of latency associated with fault handling, and uses a per-core memory allocator (not even a shared pool like jemalloc or tcmalloc). It also means that we cannot rely on the kernel to help us get a large allocation if one is not available :) That is, such a performance issue is elevated as a bug itself.

Edit:

I forgot to mention that yes we are using a pre-allocation at startup (before fragmentation becomes an issue) to solve this problem. I think the last remaining question I have related to this solution is that it's not clear to me what limits on window size we should consider "reasonable". I think we are using something like 8MB as we saw that listed in the docs, but when I was reading the kernel thread it sounded like the limit was higher. Would anything above 8MB be unlikely? How should we think about that.

facebook / zstd

Streaming decompress into fragmented (but stable) output without history buffer? #2723

full details of use case