facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.74k stars 2.11k forks source link

seekable_format/examples/parallel_compression.c is not parallel #3980

Open EbiArnie opened 8 months ago

EbiArnie commented 8 months ago

Describe the bug The parallel compression example does not seem to be working in parallel. It runs on a single thread only, regardless of how many threads are requested.

To Reproduce

  1. Build the example parallel_compression
  2. Run: ./parallel_compression data.txt $((2*1024*1024)) 4 data.txt can be any sufficiently large file so that compression takes a handful of seconds. In my case, this is ASCII data, 2.8 GB.
  3. While the compression is running, look at e.g. top
  4. Just a single thread is used. This is usually slower than the single threaded seekable_compression,

Expected behavior Parallel execution should use multiple cores and be faster than single threaded. The full zstd executable works as expected; given the -T option it is faster than without and saturates all cores.

Also, I'd love for --seekable to become a standard option for zstd, but that should probably be a separate issue.

Desktop:

vasi commented 1 month ago

This is because the parallel_compression example builds against the static library, which builds by default with multi-threading off.

You can get it working like so, from the examples directory:

make -j12 -C ../../../lib clean libzstd.a CPPFLAGS="-DZSTD_MULTITHREAD"
make parallel_compression
./parallel_compression some_large_file 100000 12

There's honestly a lot of problems with this example:

  1. We're using POOL_* functions in such a way that the entire output is stored in memory!
  2. If the POOL_* functions were working as the author seems to intend, it would be racy, since there's no barrier between writing job->done and checking jobs[i].done.
  3. You can't use a non-seekable input file (eg: stdin), since we allocate an array of jobs based on the file size.
  4. The CLI for parallel_compression isn't great. You can't specify an output filename, or a compression level, for example. And you have to understand what is a reasonable frame size.
  5. You can't build by just passsing -j12 or CPPFLAGS to make parallel_compression, because the makefile builds the library using an entirely different make invocation.
  6. You can't link against the dynamic library (which does have multi-threading enabled by default), because we use the internal POOL_* functions.

To fix these:

@Cyan4973 I'd be happy to look into this further, but I'd like guidance on what is/isn't welcome in this repo before I start if you don't mind. (I've written one of the most performant parallel-xz implementations, if you need proof I can do this.) Is there willingness to investigate adding --seekable to the standard zstd binary? If not, what sorts of changes to the example would feel ok to you, and which wouldn't?

Thanks!

EbiArnie commented 1 month ago

Hey, thanks for the detailed analysis.

Having --seekable be part of default zstd would be very nice indeed. Meanwhile, I'm using exactly the tool you linked to, t2sz.

As far as this example goes - it would be good to have working code of course.

Cyan4973 commented 1 month ago

Yes, I think it's a good idea to start working on this extension. --seekable seems like a nice way to express it.

I'm slightly concerned about the general complexity of the complete solution, but as long as it can be broken down into smaller incremental steps of tractable complexity, I'm sure it will be possible to progress regularly towards the goal.

vasi commented 1 month ago

Yes, I'm a little concerned about the complexity as well. A few reasons:

The options I see for expressing seekable mode are:

  1. Add seekable mode internal to libzstd, as a parameter. We add a paramter like ZSTD_c_frameSize=BYTES. Then make ZSTD_compressStream2 and friends just write frame headers/footers at the right places. We just re-use our existing concurrency model. The docs change to make it clear that compressStream2 can output multiple frames. a. There are a few potential variants of this, eg: making the parameter "number of blocks in a frame", or reusing jobSize as the frame size. b. getFrameProgression() becomes very poorly named now, ugh. Maybe we add an alias getProgression()?
  2. Add seekable mode internal to libzstd, but with new API functions to call. This could prevent changes to the implied single-frame behavior of functions like compressStream2. But it may complicate the API needlessly.
  3. Add seekable mode only in the zstd program, with no changes to libzstd. I really don't like this, it doesn't seem fun to add concurrency to zstd in a way that doesn't interfere with the concurrency already present in libzstd.

I'm partial to option 1.

Cyan4973 commented 1 month ago

The current seekable format is designed around the idea that each independent chunk is a frame. So yes, by design, it's a multi-frames topic.

Changing ZSTDMT_* to output multiple frames instead of a single one is technically possible. It's not even that difficult to imagine: just close and restart a frame at decided breakpoints. It introduces a few more questions about the nature of the interface, because now additional information (boundaries between frame, or eventually content size within each frame) becomes useful to transmit (to write the jump table at the end for example).

But it's also a high risk operation, because this code is used to compress everything from the zstd CLI. So a subtle vulnerability introduced here would have a pretty huge blast radius.

Option 3 has the advantage to contain the risks to the newly added code only, with less chances to impact normal operations.

A possible intermediate could be to duplicate zstdmt_compress in order to make all the modifications there, eventually taking over the original one if it manages to become a clear and safe superset. Maybe it's what you called Option 2.

vasi commented 1 month ago

Ok, I will try that approach. I'm going on vacation soon, so maybe in 6 weeks I'll back with more :)

EbiArnie commented 1 month ago

Sounds promising. I would still be happy with --seekable even without parallelism, but it sounds like you have it mostly figured out already. Thanks!