Idea: adaptive compression

annulen commented 8 years ago

I need to send large core dumps from embedded device to server, and do it in minimal time. I've found zstd to be optimal solution because it is very fast and has good compression ratio. I'm using it like this:

zstd  -c | curl -T - $url

where kernel fills stdin of zstd

However, if user has narrow bandwidth for upload, it could be benifical to switch from fast compression to high compression which in my case is 30% more effective. For example, if in first 10 seconds zstd detects that compression throughput is N times larger than write speed (in this case to stdout), it automatically switches high compression and uses it up to the end of input stream.

Would such feature make sense?

Cyan4973 commented 8 years ago

Yes. I have this idea in mind for quite some time.

I believe it is first necessary to implement multi-threading, in order to achieve IO and compression in parallel. Then it seems possible to check and compare both speeds.

That's not a easy feature, so it will take some time to make it right.

stati64 commented 8 years ago

We do this with:

a monotone list of Pareto Convex Frontier compressor+parameter. e.g. raw, lz4+256, lz4+32, lz4+2, zstd, lzlib+0, lzlib+3
an idle time goal e.g. 5%.
16 step Pulse Width Modulation between adjacent members of 1, e.g. typedef union{ u32 level; struct{ u08 smooth2:4; //smoothing u08 blend2:4; //blending adjacent compressors u08 basec2:8; //base compressor u16 fill2:16; //unused }bit2; }blender;

renevdzee commented 8 years ago

I've been thinking about the same thing for about a year and came up with the following algorithm to optimize for the slowest resource in the chain: Create n+2 threads; n compression threads, 1 input thread and 1 output thread.

the input-thread reads data blocks from the input and puts these in an input buffer
the output-thread reads compressed blocks from an output buffer and writes them to the output
the compression threads allocate and compress blocks from the inputbuffer, where the compression level selection is based on how full the input and output buffers are.

When the inputbuffer is below a certain level or the output buffer is near-full a higher compression level is selected. With a full inputbuffer or an empty output buffer a faster compression level, etcetera.

When used over a network connection, whenever there is congestion in the network or on the receiving side (cpu or i/o) the output buffer will fill up and the compression threads will switch to a higher compression ratio. Thus optimizing for highest throughput end-to-end.

regards, René

Cyan4973 commented 8 years ago

@renevdzee 's algorithm is very good, and basically what I had in mind.

That being said, in a streaming scenario, the amount of changes that can be done to compression parameters while preserving compression context is very low. Currently, among the multiple parameters accessible through _advanced() functions, only one, searchLog, can be safely modified between blocks while preserving frame compression context. Another one, strategy, can sometimes be changed, but not always, making it more difficult to adapt. It still gives some room for adapting speed, but cannot range from 1 to 20.

Once these restrictions are accepted, it becomes possible to adapt @renevdzee 's methodology.

annulen commented 8 years ago

@Cyan4973: What do you mean in "strategy, can sometimes be changed, but not always"? Are there any specific conditions?

Cyan4973 commented 8 years ago

Oh yes, it's a bit complex though.

"fast" and "btlazy2" have their own specific table layout, incompatible with other strategies. So it's not possible to swap them "on the fly". The 3 remaining strategies have compatible table layouts, so could be swapped.

Now that I'm thinking again about it, "btlazy2" has additional requirements : it's possible to reduce "searchLog" on the fly without breaking the tables, but increasing this value is less clear. I believe it's possible, because the search ends by cleaning the tail of the tree, effectively reducing search length for future searches and inserts, so it should be okay, but without a test, there are remaining risks...

annulen commented 8 years ago

I do not care much about btlazy2, but I'm going to start from fast by default

annulen commented 8 years ago

In my particular case I think it may be possible to proceed without preserving compression context, just choose "best" params for the next ~~100MB chunk~~ 5MB of output and start from scratch.

Cyan4973 commented 8 years ago

Yes, 100 MB is a fairly large size, it's pretty reasonable to use that value to cut input stream into independent chunks.

annulen commented 8 years ago

I've written 100 MB first but then realized that my data (core dump) is heterogenous and it's more resonable to split by output size instead.

annulen commented 8 years ago

Is it possible to reuse only dictionary instead of full compression context?

Cyan4973 commented 8 years ago

@annulen : in theory, yes, it is.

The problem is, loading a dictionary is, in itself, a costly operation. The higher the compression ratio, the worse it costs.

In the end, most of the time might be spent just loading dictionaries. So it does not seem interesting as a speed optimisation.

chadnickbok commented 8 years ago

I'd argue that dynamic compression should remain outside the scope of this library.

Instead, I'd argue that an approach similar to that of realtime video compression libraries (like x264) would be more appropriate - focus on providing an API that has calls for changing things like compression complexity on-the-fly, and allow others to build tools that implement algorithms for specific scenarios.

X-Ryl669 commented 7 years ago

Another use case would be for a backup project where some statistics are computed from a buffer to be compressed, and depending on the statistics, one would use either the low or the high compression ratio (maybe 2 instances?). In that case, you don't select based on time spent or FIFO pressure but more likely on the data themselves. If the entropy is high, it's very likely the input data is already compressed (or not compressible) and it's worthless try to burn CPU with the high compression algorithm. Similarly, if the data is redundant, then it makes sense to try to compress it harder.

Said differently, spend your force when it matters more.

paulcruz74 commented 7 years ago

I've recently been working on a new utility that acts as a wrapper for ZSTD and provides an algorithm for adapting compression level based on data flow. The utility is currently included in the repository under the contrib directory as adaptive-compression.

The algorithm works by breaking up data into 40MB chunks and creating three threads (reading, compressing, writing). Each 40MB chunk becomes a compression “job”. For each job, a set of two buffers is used. The reader thread fills the first buffer with 40MB worth of data, the compression thread compresses that into the second buffer, and finally the writer thread writes from the second buffer. Because the reader and writer threads operate on separate buffers, they run concurrently and only have to wait on the compression thread. Conversely, the compression thread must wait on both the reader and the writer threads. The compression level is determined in the compression thread right before actually compressing the data. At a high level, if the compression thread was previously waiting, compression is too fast and we can increase compression level. Similarly, if the writer or reader thread is waiting, we are compressing too slowly and should decrease compression level. I have also implemented a few mechanisms to determine how much each thread had completed its job before one had to wait. This can be helpful in a situation such as if the compression thread was waiting on the writer thread, but the writer thread was 99% finished.

So far, I have tested the tool with artificial tests (limit pipe speeds), as well as with some ssh tests that copy files over a network. In scenarios where the static compression level is too low (compression faster than pipeline speed), the algorithm improves compression ratio. Furthermore, the tool does not bottleneck the pipeline (which would occur when static compression level is too high).

Since this is a long-standing request, it would interesting to see if anyone has any feedback, questions, or thoughts on the tool.

annulen commented 7 years ago

Do you have any source code publicly available?

In my use case (compression of core dumps) data is read from pipe. I wonder if it's still a good idea to use separate reading thread in this case. Also, chunk size needs to be tunable.

paulcruz74 commented 7 years ago

The code is available in contrib/adaptive-compression with an included Makefile for building it.

data-man commented 6 years ago

The code is available in contrib/adaptive-compression with an included Makefile for building it.

Not compiled with a several errors (for a latest dev branch).

Cyan4973 commented 6 years ago

Fair enough. We haven't integrated /contrib projects into CI, and as a consequence, do not regularly check if they still compile. That could be improved.

annulen commented 6 years ago

It would be nice to have contrib/adaptive-compression program combined with contrib/pzstd

Cyan4973 commented 6 years ago

The feature will be integrated into zstd regular command line utility, with command --adapt. See #1327.

Cyan4973 commented 6 years ago

The --adapt feature is now present in zstd v1.3.6 .

facebook / zstd

Idea: adaptive compression #75