facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.82k stars 2.11k forks source link

Block splitter control parameter #4180

Closed Cyan4973 closed 3 weeks ago

Cyan4973 commented 4 weeks ago

Make it possible to explicit select a block splitter level, via a new CCtx parameter ZSTD_c_blockSplitter_level.

This capability is then exploited in a test, ensuring that incompressible data is not overly split, even in presence of an adversarial input (with full knowledge of the sampling pattern).

Note: possible follow-up: This PR just adds a new parameter, to control the behavior of the new block splitter. It doesn't modify the existing parameters.

But as a consequence, there are now 2 parameters for block splitters, one (legacy) that is controlling the post block-splitter (after sequences are determined) and a new one that is controlling the new pre block-splitter (before sequences are produced). Already, it's debatable if it's useful for a user to be exposed to these concepts. More importantly, the distinction between pre and post block splitter is not clear from the current parameters' names (ZSTD_c_useBlockSplitter vs ZSTD_c_blockSplitter_level).

So it opens the question of a refactoring of these parameters. For example, maybe both parameters could be fused into a single one, the new ZSTD_c_blockSplitter_level, that would be charged to enable both when level is high enough. Or maybe there is still value in keeping both these parameters separated, for example for an optimizer tool which could more naturally influence both code paths and maybe find a better combination for some specific use case. In which case, it's probably still useful to debate about meaningful parameter names.

Cyan4973 commented 3 weeks ago

I think I'll keep both parameters, because both splitting methods (before and after sequences) may evolve independently, and the way they combine or compete could change over time.

But it's necessary to change the names, so that it's less misleading.

In particular, ZSTD_c_useBlockSplitter implies a full on/off control over anything block-splitter related, but that's not what this parameter is doing: it only controls the second splitter (which used to be the only splitter), which is triggered after sequences determination. So the parameter name should reflect that scope.

Current name in mind: ZSTD_c_sequenceSplitter, which emphasizes the fact that it's related to sequences. Importantly, it makes it clear that ZSTD_c_blockSplitterLevel and ZSTD_c_sequenceSplitter are 2 separate decisions. However, the name could also be understood as splitting sequences, as opposed to splitting blocks according to sequences already found.

Cyan4973 commented 3 weeks ago

Finally settled on ZSTD_c_splitAfterSequences.