boludoz / lz4

Automatically exported from code.google.com/p/lz4
Other
0 stars 0 forks source link

performance issues when piping data to lz4 #129

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Hi,

I'm trying to compress a large chunk of data produced by another program on the 
fly with the lz4 command line tool. The program writes the data (about 8GB) to 
a stdout or to a FIFO (like the ones created with mkfifo). Then, I either let 
lz4 read from stdin or from the FIFO. As you may know, the buffer size of a 
pipe is about 64KB. (this applied to both the | operator as implemented by your 
shell) and the FIFOs creates by mkfifo. I have no idea how to increase this 
buffer size.

Examples:
1) write8GB /dev/stdout | lz4 -1 - output.lz4
2) mkfifo myFifo; write8GB myFifo &; lz4 -1 myFifo output.lz4

Now with default settings (-B7), it takes 22seconds until the whole process is 
complete. With -B4 on the other hand, it takes only 13 seconds. The time is 
almost cut in half.

The issue here is that the program producing the data has to wait when the pipe 
is full and lz4 is compressing. Also, if the pipe is suddenly empty because lz4 
reads the next block, then lz4 has to wait for the program to produce the data.

Actually, the two programs never really run in parallel, wasting previous time. 
The is typically not an issue with compressors that read data in smaller chunks.

It would be nice, IMHO, if lz4 would have a buffer of two or three block, and 
if an internal thread of lz4 would read ahead while it is currently compressing 
one block. Now a similar problem might exist while decompressing.

There is a program called buffer (i.e. you execute cmd1|buffer|cmd2 instead of 
cmd1|cmd2), but for some reason it wastes a LOT of cpu time (almost as much as 
lz4 itself!). Maybe, because it copies the data around.

I reliaze that I could also blame the program that's producing the data (it 
does not produce the data ahead of time), but you will find that most tools 
won't (like tar for example).

Original issue reported on code.google.com by sven.koe...@gmail.com on 13 May 2014 at 1:03

GoogleCodeExporter commented 9 years ago
Hi Sven,
Yes, you are correct.
Since the posix lz4 utility is not multi-threaded, compression has to wait 
while loading the internal buffer, and it cannot load while compressing.
The best way to mitigate that effect is to use small buffers.
This is what is achieved using -B4.

A potentially interesting setting would be to use -B4D instead.
This creates chained 64KB blocks (instead of independent ones),
significantly increasing compression ratio.

Make sure to use lz4 release r117+ though, since older versions had a bug 
regarding chained blocks command. See : 
http://code.google.com/p/lz4/issues/detail?id=127

Regards

Original comment by yann.col...@gmail.com on 13 May 2014 at 1:53

GoogleCodeExporter commented 9 years ago
I'm currently trying to reproduce your issue, but without success so far.

The following command line :
cat filename | lz4 > /dev/null
wouldn't produce any significant difference between -B7 and -B4.

Is there a way I could reproduce your issue, to better study it ?
(Assuming I'm not installing Xen and an 8GB RAM VM to save its state, I need 
something more lightweight...)

Original comment by yann.col...@gmail.com on 14 May 2014 at 7:37

GoogleCodeExporter commented 9 years ago
I think the issue is that "cat filename" is simply too lightweight. The Kernel 
performs read-ahead on files I believe and if cat becomes blocked at some 
point, it can produce new data very fast. With xen, and dumping a domUs memory, 
this does not seems to be the case.

Original comment by sven.koe...@gmail.com on 14 May 2014 at 10:47

GoogleCodeExporter commented 9 years ago
Just to be clear : is this performance issue basically requesting to implement 
multi-threading within the LZ4 command line utility ?

Original comment by yann.col...@gmail.com on 11 Jun 2014 at 9:17

GoogleCodeExporter commented 9 years ago
The idea would be to have a thread that fills a ringbuffer with data. The main 
thread would get its data from this ring buffer instead of reading from stdin 
or a file directly. It's easy to implement that. I'm an expert in pthread 
mutexes and conditions. But I haven't had a time to implement and test it yet.

Unfortunately, and that would be many times easier, it is not so easy to 
increase the size of the pipe that is hidden inside stdin or a fifo.

Original comment by sven.koe...@gmail.com on 15 Jun 2014 at 11:40

GoogleCodeExporter commented 9 years ago
Clear enough.
Unfortunately, my current multi-threading code is Windows specific, not 
portable.
I did not spent time to learn how to write portable multi-threading code.
I guess pthreads is likely the way to go, while also keeping the ability to 
generate single-threaded code for platforms unable to support pthreads.
Without external support, this objective will have to wait a bit.

Original comment by yann.col...@gmail.com on 16 Jun 2014 at 1:06

GoogleCodeExporter commented 9 years ago
Another potential way to answer such request
would be to "default" to 64 KB block size
when lz4 utility is used in pure-pipe mode.

Original comment by yann.col...@gmail.com on 1 Jul 2014 at 6:58

GoogleCodeExporter commented 9 years ago
In that case exactly one block fits into the pipe's buffer. I believe that will 
harm performance.

Original comment by sven.koe...@gmail.com on 1 Jul 2014 at 7:06

GoogleCodeExporter commented 9 years ago
harm ?

Original comment by yann.col...@gmail.com on 1 Jul 2014 at 7:07

GoogleCodeExporter commented 9 years ago
"harm" if compared to having a larger buffer, I mean.
Certainly, as my benchmarks showed, using 64kB block size improved performance. 
But having a buffer that can hold multiple blocks can improve performance even 
further, IMHO.

Original comment by sven.koe...@gmail.com on 1 Jul 2014 at 7:10

GoogleCodeExporter commented 9 years ago

Original comment by yann.col...@gmail.com on 6 Jul 2014 at 8:19