Closed Artoria2e5 closed 6 months ago
.... I am an idiot, the 1000 is the timeout for liblzma's internal buffering. That's a reasonable amount. The actual block_size
is 0, "let liblzma decide". The dictionary part must also not be too bad, considering a lot of data can happen in one second.
In https://github.com/boostorg/iostreams/pull/95, boost iostreams received the ability to call
lzma_stream_encoder_mt
. However, boost currently uses a chunk size of 1000 bytes in:This is about 1000 times smaller than what the liblzma documentation recommends: https://github.com/frida/xz/blob/e70f5800ab5001c9509d374dbf3e7e6b866c43fe/src/liblzma/api/lzma/container.h#L82
Because of how xz multithreading works, we are basically chunking the input down to 1000 byte fragments and feeding them to separate compressors instances here. I expect this to cause some serious issues with the compression ratio, though I haven't written any code to confirm it yet.
In addition, because
lzma_stream_encoder_mt
(chunking) is used even withthreads_
set to 1, this compression ration regression should hold for all builds that don't haveBOOST_IOSTREAMS_LZMA_NO_MULTITHREADED
enabled.What to do?
Can we increase the chunk size to a reasonable minimum, such as 1 MiB? What was the small chunk size trying to avoid, and is it that big of a concern? Surely lzma does its own buffering, right?
Tangential: dictionary size, passing options
At the same time, there is also no point in using a dictionary size bigger than the chunk or source-data size, whichever is smaller. If we are chunking, then the maximum reasonable dictionary size simply is the chunk size.
Right now we just let people choose their xz preset, but that's causes basically useless scaling of dictionary size, and with that RAM usage. The only part of the preset that matters are the cpu-effort parameters; the dictionary SHOULD NOT be scaled. If we look at the xz manpage, we can see a suggestion for the exact thing:
Boost should, ideally, do something similar to parse the level and cap the dictionary size --- once we figure out the chunk size thing.