RFE: decompressor can work with simple malloc

facebook / zstd

Zstandard - Fast real-time compression algorithm

http://www.zstd.net

Other

23.61k stars 2.1k forks source link

RFE: decompressor can work with simple malloc #3453

Open jreiser opened 1 year ago

jreiser commented 1 year ago

Is your feature request related to a problem? Please describe. Request For Enhancement (RFE): Please document, enhance, and maintain Zstd so that the decompressor can work with a malloc+free that uses a private allocation arena with various simple allocation strategies. This request arises from the desire to use Zstd decompression in primitive or "embedded" environments where general-purpose malloc+free is expensive in space, time, and/or development time.

Describe the solution you'd like The simple allocation strategies (truly minimal complexity and code) for decompression are:

Fixed-in-advance arena size and block placement, with offsets parameterized by 3 or fewer parameters. free() is a no_op. The run-time decompressor for "upx --lzma" uses a single alloca() of 12KiB to 58KiB, with internal divisions depending on the LZMA parameters literal_context_bits and position_bits, which have been restricted at compression-time.
malloc() allocates each (aligned) request adjacent to the previous allocation; free() is a no_op. The compressor reports the total size of all requests that will be made during decompression by the known decompressors.
Stack allocation from one end of the arena: malloc() is a Push, free() is a Pop. The compressor reports the minimum size of the stack.
Two stacks, growing towards each other from the ends of the arena. The decompressor code specifies which stack should be used by each call to malloc(). This allows separating allocations with long lifetimes from those with short lifetimes, which usually can reduce the size of the arena and simplify the scheduling of calls to free().
Stack, but with mark()+release() like Pascal. This can be even more flexible because release() removes the ordering restriction on what would otherwise be a sequence of calls to free(). Strategies 0 and 1 above (where free() is a no_op) are special cases of implied mark() at the beginning of decompression, and implied release() at the end.

The return values from calling the compressor must include the minimum size of the private decompression arena for each of the simple allocation strategies.

Describe alternatives you've considered Implement general stand-alone malloc+free for the primitive environments.

Additional context The follow-on request will be for a compressor which takes an incoming argument that specifies the size of the arena that the decompressor can support. The compressor must choose parameters and strategy (including which variant of decompression code) that is constrained by that maximum, taking advantage of the properties of each simple implementation of malloc+free during decompression.

felixhandte commented 1 year ago

Hi @jreiser,

Zstd has a couple existing mechanisms to support these scenarios:

Static Contexts

These functions allow you to construct a zstd context that will only use the provided buffer and will never call malloc() or free(). You can check how much memory is required to support the (de)compression of a message with the context size estimation functions.

Custom Allocators

These allow you to pass in custom malloc() and free() functions into a context, which could implement the strategies you describe. When present, zstd will use these functions instead of the libc implementations.

Some of these allocation strategies impose rather tight constraints on the allocation pattern of zstd. Fortunately, zstd has basically the simplest possible allocation pattern! A cctx or dctx will only ever have two allocations live. The first is the actual ZSTD_CCtx or ZSTD_DCtx struct, which is allocated once and lives until the context is destroyed. The second is a single allocation for the working area of the context, which zstd internally partitions into the various data structures it needs to support the operation. If/when that arena is too small to support an operation, zstd will free it and then allocate a new bigger one. (It also has a policy to downsize after enough operations that use much less space than is allocated.)

Edit to add: see for example this use to give contexts huge pages.

Redefining Macros

The zstd source code accesses malloc() and free() (and calloc()) through the macros ZSTD_malloc() and ZSTD_free() (and ZSTD_calloc()). You may need to change the definitions to something else if your platform doesn't have an implementation of malloc (see, e.g., our redefinitions for the zstd built into the kernel).

I think that addresses everything you mentioned! If I missed something though, let me know.

jreiser commented 1 year ago

@felixhandte Hi Felix,

Your description of Static Contexts and Custom Allocators gives me a good start. In particular A cctx or dctx will only ever have two allocations live tells me that the decompressor makes only 2 calls to ZSTD_malloc. [But doesn't a dictionary make a third call?] However I am troubled by If/when that arena is too small to support an operation, zstd will free it and then allocate a new bigger one because in primitive environments too small is fatal: a bigger space is not available. The maximum amount of space must be known (or easily computable) before decompression starts. (And in the follow-on enhancement, the maximum size must be a constraint that is fed to the compressor.)

I am also troubled by this code in lib/common/fse_decompress.c:

size_t FSE_decompress(void* dst, size_t dstCapacity, const void* cSrc, size_t cSrcSize)
{
    /* Static analyzer seems unable to understand this table will be properly initialized later */
    U32 wksp[FSE_DECOMPRESS_WKSP_SIZE_U32(FSE_MAX_TABLELOG, FSE_MAX_SYMBOL_VALUE)];
    return FSE_decompress_wksp(dst, dstCapacity, cSrc, cSrcSize, FSE_MAX_TABLELOG, wksp, sizeof(wksp));
}

The local array wksp[] occupies space on the stack, and in primitive environments the stack often is severely constrained. In effect wksp is allocated, and should be allocated by ZSTD_malloc. Perhaps there are other similar cases of large allocations on the stack?

In a different way, I am also concerned about [the] single allocation for the working area of the context, which zstd internally partitions into the various data structures. Is it really the case that each of the partitions must be present at every point of execution, or could space be reduced (shared) by allocating and de-allocating at appropriate times? I desire that my custom malloc be consulted for every large or variable-sized allocation, and I wish for the code of zstd to be enhanced to allow stack-style (or 2-stack-style) management of space within the dctx arena.

felixhandte commented 1 year ago

@jreiser,

But doesn't a dictionary make a third call?

You're right, if you load a dictionary into the context, that may or may not internally trigger the creation of a ZSTD_CDict or ZSTD_DDict, which will be a separate allocation. In general, a better pattern would be to explicitly create these objects via ZSTD_createCDict_advanced() and ZSTD_createDDict_advanced() and then link to them via ZSTD_CCtx_refCDict() and ZSTD_DCtx_refDDict(). This will avoid surprise allocations.

The maximum amount of space must be known (or easily computable) before decompression starts.

If you're not doing streaming decompression, the ZSTD_DCtx will actually only require a fixed, small amount of space indicated by ZSTD_estimateDCtxSize() (in addition to the user-supplied input and output buffers).

If you need to do streaming decompression, you can use e.g. ZSTD_estimateDStreamSize_fromFrame() to know how much space is required. And by the way, if you can use ZSTD_d_stableOutBuffer, this can shrink the required memory significantly.

Perhaps there are other similar cases of large allocations on the stack?

@terrelln looked at zstd's maximum stack usage in the context of the kernel. He may be able to comment on how much we use.

Is it really the case that each of the partitions must be present at every point of execution, or could space be reduced (shared) by allocating and de-allocating at appropriate times? I desire that my custom malloc be consulted for every large or variable-sized allocation, and I wish for the code of zstd to be enhanced to allow stack-style (or 2-stack-style) management of space within the dctx arena.

Hmmm. Perhaps in some cases.

Especially during repeated compression though, zstd exploits the fact that it has maintained custody of the working memory to avoid re-initializing various internal data structures.

Why would it be useful for your malloc to be called for these internal allocations?

jreiser commented 1 year ago

If you're not doing streaming decompression, the ZSTD_DCtx will actually only require a fixed, small amount of space indicated by ZSTD_estimateDCtxSize() (in addition to the user-supplied input and output buffers).

There's a misunderstanding somewhere. Decompressing output from compression by lzma requires re-constructing the same Markov matrix of transition probabilities that the compressor constructed originally, and in particular the same number of bytes. So a zstd compression level of 18 or more requires megabytes of space. I don't believe that the decompressor always can cram the matrix into unused portions of the output buffer, or into discarded portions of the input buffer, particularly for overlapping decompression. And anyway, the usage of "unused portion of output buffer (i.e., not used for re-constituted original input)", and if/when the entire input buffer can be considered const, must be documented explicitly. So, I desire more explanation about where the matrix lives. (Probably it resides on the stack; but a byte consumes memory regardless of where it lives.) Zstd has more compression choices than just lzma, but that means more documentation about the space required during decompression.

[upx --lzma restricts the compressor's parameters and choices in order to guarantee that decompression can be done in small space (less than 64KiB), yet still achieves noticeably better compression than other LZ* methods that remember only one active offset.]

Why would it be useful for your malloc to be called for these internal allocations?

In addition to small, simple, and fast code, then primitive environments want small space usage. If dynamic sharing of the space required for the internal allocations can save even some kilobytes, then that is a win.

Cyan4973 commented 1 year ago

Decompressing output from compression by lzma requires re-constructing the same Markov matrix of transition probabilities that the compressor constructed originally, and in particular the same number of bytes. So a zstd compression level of 18 or more requires megabytes of space. I don't believe that the decompressor always can cram the matrix into unused portions of the output buffer,

The Zstandard decompression state ZSTD_DCtx only employs a fixed amount of memory, whatever the compression level originally employed to compress the data (this stands for one-shot decompression scenarios, which I assume is the case here, it's the most frugal memory-wise, and the only one that makes sense for embedded environments).

Unlike lzma, zstd doesn't require any variable "matrix space". Indeed, having just a very small state to update is part of the reasons why zstd decompression speed can be much faster.

terrelln commented 1 year ago

size_t FSE_decompress(void dst, size_t dstCapacity, const void cSrc, size_t cSrcSize)

Zstd does not use this function. Originally the FSE library was consumed from upstream, and periodically synced, so we had to keep functions that zstd does not use. We've since decided to fork it, so we're free to delete those functions. We've done that for Huffman, but haven't for FSE yet.

Zstd's stack usage is tested in CI, and varies slightly by compiler, and method called, but comes in under 2KB in total for any compression & decompression function. IIRC decompression is smaller than that, more like 1KB, but I'd have to double check.

Upstream zstd is used as-is in the Linux kernel, and doesn't run into stack issues there. And we're committed to not expanding our stack usage, so we can continue to sync-in upstream zstd to the Linux kernel.

terrelln commented 1 year ago

Zstd does not use this function

I'll just go ahead and delete the unused FSE functions now, no reason to wait.

jreiser commented 1 year ago

I see that 423a74986f7416ca8f54b20d0fe880f2d232b213 of almost two weeks ago has removed size_t FSE_decompress(void* dst, size_t dstCapacity, const void* cSrc, size_t cSrcSize), so that's one fewer de-compression subroutine with significant stack usage. Howver 488f7c0ff1ea903e3beaa616fb0af311f18fea84 of today still has in file lib/common/fse_decompress.o:

FSE_decompress: eda: 48 81 ec 20 42 00 00 sub $0x4220,%rsp

which is 16.5 KB.

Turning to Huffman decompression in file lib/decompress/huf_decompress.o:

HUF_decompress4X2: 9a89: 48 81 ec 18 48 00 00 sub $0x4818,%rsp

HUF_decompress1X2: 9c79: 48 81 ec 18 48 00 00 sub $0x4818,%rsp

which is two instances of 18 KB.

How can such large stack usage be avoided (and with what consequences?), or turned into malloc+free?

Cyan4973 commented 1 year ago

This is the same idea as before here. Both functions are part of a more generic entropy library, but are actually unused within libzstd. Since we stopped synchronizing with the entropy library, we could now specialize this file for libzstd, and remove these symbols, with no impact on libzstd. There is little urgency for this action because DCE is expected to already eliminate these bodies automatically at link time anyway, so they are not really present in the final binary. No stack was ever harmed by these symbols.