lz4 cli should support sparse file

GoogleCodeExporter commented 9 years ago

Idea :
when decompressed stream contains lots of zeroes, 
the lz4 cli could generate a sparse file (instead of a full sized one) on File 
Systems which do support this feature.

This is a minor but interesting optimization.

Original discussion :
https://groups.google.com/forum/#!topic/lz4c/ExYLfHCY-UY

Original issue reported on code.google.com by yann.col...@gmail.com on 10 Feb 2015 at 2:33

GoogleCodeExporter commented 9 years ago

> I'm just a bit curious : is fseek() guaranteed to always produce zero-filled 
space, no risk of "garbage-filled" space instead ?

Strictly, no. For major platforms, yes. Practically, even if we'll support 
sparse-file by fseek(), we should remain current conservative code.

# C Standard
I think standard (C89,99,11) says nothing about it. Standard only guarantees 
ftell()/fseek() pair (See N1570 / 7.21.9.2 The fseek function).
So maybe it's platform/filesystem dependent.

# Major platforms
But the following major platforms/libraries supports 'zero-filled' behaviour.

POSIX : http://pubs.opengroup.org/onlinepubs/009695399/functions/fseek.html
> The fseek() function shall allow the file-position indicator to be set beyond 
the end of existing data in the file. If data is later written at this point, 
subsequent reads of data in the gap shall return bytes with the value 0 until 
data is actually written into the gap.

glibc : 
http://www.gnu.org/software/libc/manual/html_node/File-Position-Primitive.html
> If you set the position past the current end, and actually write data, you 
will extend the file with zeros up to that position.
I belive many embedded OS/platforms use gcc/glibc pair, or mimic glibc's 
behaviour.

FreeBSD : http://www.freebsd.org/cgi/man.cgi?query=fseek
> The fgetpos(), fsetpos(), fseek(), ftell(), and rewind() functions con-form 
to ISO/IEC 9899:1990 (``ISO C90'').
> The fseeko() and ftello() functions conform to IEEE Std 1003.1-2001 
(``POSIX.1'').
FreeBSD's fseek() internally uses fseeko(). So it also conforms POSIX.
http://svnweb.freebsd.org/base/head/lib/libc/stdio/fseek.c?view=markup

MacOSX / iOS : 
https://developer.apple.com/library/ios/documentation/System/Conceptual/ManPages
_iPhoneOS/man3/fseek.3.html
> The fseeko() and ftello() functions conform to IEEE Std 1003.1-2001 
(``POSIX.1'').
BSD family.

Windows / MSVC : https://msdn.microsoft.com/en-us/library/75yw9bf3.aspx
> The pointer can also be positioned beyond the end of the file.
It says nothing about what will happen when we set it beyond EOF. But since 
MSVC-crt just uses Win32 APIs, NTFS sparse file (or zero-filled file for old 
filesystems) will be created.
(See %ProgramFiles(x86)%\Microsoft Visual Studio 12.0\VC\crt\src\fseek.c : 
fseek() -> _lseek() -> SetFilePointerEx())

# Solution
Since C standard say nothing about fseek() beyond EOF, we should say it's 
platform dependent. But major platforms support zero-filled behaviour and 
sparse file.
So it would be nice to implement sparse file support as default but 
configurable part.

Original comment by takayuki...@gmail.com on 11 Feb 2015 at 2:01

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Neil's benchmarks
https://groups.google.com/d/msg/lz4c/ExYLfHCY-UY/uXCG0R98IR8J

Based on Neil's benchmarks, we can estimate effect of native sparse file 
support.

(A) 215 sec : lz4 native
(B) 314 sec : lz4 -cd | cp
(C) 202 sec : lz4 -cd | cp --sparse=always

(B)-(A) = 99 sec : native vs pipe cost
(B)-(C) = 112 sec : --sparse=always gain
(A)-((B)-(C)) = 103 sec : (Estimated) lz4 native with sparse-file support

My impressions :

 - Neil was trying to compress 80GiB files in few minutes. Aside from using sparse file or not, I believe it's a good problem domain for lz4 cli.
 - Pipe is too heavy :(  But its 80GiB !
 - Almost 50% gain is huge. It's worth trying.

Original comment by takayuki...@gmail.com on 11 Feb 2015 at 2:07

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

@Yann, aside from this issue.
In MSVC crt, fseek(x, SEEK_CUR) seems quite problematic to use with large 
(>2GiB) file.
So we should change fseek() to _fseeki64() in lz4io.c

----
Notes for Windows / MSVC

(1) We must use DeviceIoControl() to set FSCTL_SET_SPARSE attributes explicitly.
(2) We must use _fseeki64(). MSVC's fseek() has 2GiB barrier.

in fseek.c : fseek() -> _fseek_nolock(), offset is converted SEEK_CUR to 
SEET_SET.
But during this calculation, offset is long (32bit signed int). It is 2GiB 
barrier.

int _fseek_nolock (FILE *str, long offset, int whence) {
    ...
        if (whence == SEEK_CUR) {
                offset += _ftell_nolock(stream); // 2GiB barrier
                whence = SEEK_SET;
        }
    ...
}

Same function in fseek64.c : _fseeki64() -> _fseeki64_nolock(). In this case, 
offset is __int64 (64bit signed int).

_fseeki64() is supported since Visual Studio 2005
https://msdn.microsoft.com/en-us/library/75yw9bf3%28v=vs.80%29.aspx

In Visual Studio 2005, _MSC_VER == 1400
http://stackoverflow.com/a/70630/2132223

Original comment by takayuki...@gmail.com on 11 Feb 2015 at 3:46

Added labels: ****
Removed labels: ****

Attachments:

win32-sparse-file.cpp

GoogleCodeExporter commented 9 years ago

This issue is greatly documented.
Many thanks Takayuki !

Original comment by yann.col...@gmail.com on 11 Feb 2015 at 5:59

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

It's worth noting from the xz code that the fiddly bit is dealing with zeros at 
the end of a file. You generally have to write at least the last byte to make 
sure that the length is correct. 

Additionally the trick only works for regular seek able files, so you have to 
check that the output is the right sort first (but can include stdout which may 
just be a regular file redirect rather than a pipe).

However I would like a switch that enables the behaviour on any seek able 
device - since my use case targets block devices directly and I can easily 
guarantee they are zero filled before hand.

Original comment by n...@aldur.co.uk on 11 Feb 2015 at 7:31

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Useful discussion here on testing a buffer for zero: 
http://stackoverflow.com/questions/1493936/faster-approach-to-checking-for-an-al
l-zero-buffer-in-c

Fastest is to use the memcmp trick with 64 bit integers - assuming you can get 
around the alignment and non-integer buffer size issues.

Original comment by n...@aldur.co.uk on 11 Feb 2015 at 7:33

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I've implemented experimental version :
https://github.com/t-mat/lz4/tree/gc-issue/155

# Build

To build this experimental version, define 'LZ4IO_ENABLE_SPARSE_FILE' like the 
following command :

    make lz4programs 'CFLAGS=-O3 -DLZ4IO_ENABLE_SPARSE_FILE=1'
    ./programs/lz4 -h

You will see "EXPERIMENTAL_SPARSE_FILE" as lz4 revision :

    "*** LZ4 command line interface 64-bits EXPERIMENTAL_SPARSE_FILE, by Yann Collet (...) ***"

# Experiment

This experimental version adds option "-x" for sparse file for decompression.
You can use this option like this :

    ./programs/lz4 -9 -f my-file
    ./programs/lz4 -d -f -x my-file.lz4 my-file.lz4.out
    cmp my-file my-file.lz4.out

Original comment by takayuki...@gmail.com on 12 Feb 2015 at 6:49

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

@Neil
Since my implementation is just a proof-of-concept, I'm glad to see yours :)

Original comment by takayuki...@gmail.com on 12 Feb 2015 at 7:36

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

If I may I will take yours and enhance it. I prefer to stand on the
shoulders of giants.

Original comment by n...@aldur.co.uk on 12 Feb 2015 at 8:25

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Great work Takayuki !

Original comment by yann.col...@gmail.com on 12 Feb 2015 at 8:26

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Benchmark results : https://gist.github.com/t-mat/6babbbc9edf726fbaf9c
Perhaps it's compiler dependent, but GNU coreutil's is_nul() is surprisingly 
good !

Original comment by takayuki...@gmail.com on 12 Feb 2015 at 12:07

Added labels: ****
Removed labels: ****

Attachments:

benchmarks.txt

GoogleCodeExporter commented 9 years ago

is_nul needs a sentinel after the buffer to work properly. Not that
that is difficult to create.

Original comment by n...@aldur.co.uk on 12 Feb 2015 at 1:03

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

@Neil
Thanks. I've fixed it : 
https://github.com/t-mat/lz4/commit/0fc157798abca49f5bcdc2c34fdaab013038e474
Could you confirm my benchmark ?  I'm still not sure about my result is my 
envionment dependent or not.

Original comment by takayuki...@gmail.com on 12 Feb 2015 at 1:47

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

@Yann,

First of all, thanks for the effort. I've read recent "sparseFile" branch.
Maybe the following issues are known. I wrote this as future check list.

(1) For Windows platform, additional DeviceIoControl() is needed.
https://github.com/t-mat/lz4/blob/gc-issue/155/programs/lz4io.c#L102
https://github.com/t-mat/lz4/blob/gc-issue/155/programs/lz4io.c#L112
https://github.com/t-mat/lz4/blob/gc-issue/155/programs/lz4io.c#L789

(2) Currently, we can't "concatenate" sparse area larger than block size.
https://github.com/Cyan4973/lz4/compare/ceec6fa849...e38c268b5a#diff-abb30b1425d
c9e3e5441d294b2f7e909R611

(3) It seems slower than GNU coreutil's is_nul(). Some benchmarks are needed.
https://github.com/Cyan4973/lz4/compare/ceec6fa849...e38c268b5a#diff-abb30b1425d
c9e3e5441d294b2f7e909R609

Original comment by takayuki...@gmail.com on 11 Mar 2015 at 9:26

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Great review, thanks Takayuki !

I plan to work on point 2, so it's likely going to be improved in a next 
release.

Thanks for pointing DeviceIoControl() requirement on Windows, it totally evaded 
my reading.

Point 3 looks a bit difficult to measure. I currently have no idea on how to 
bench such impact reliably.
There is also a small difference in scope :
is_nul() expects the whole input buffer to be null,
but the little loop inside current work version can stop anytime within the 
block, meaning we are still going to skip whatever amount of zero data has been 
detected, even if the buffer is not entirely null...

Original comment by yann.col...@gmail.com on 11 Mar 2015 at 10:25

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

The new version at https://github.com/Cyan4973/lz4/tree/sparseFile has improved 
sparse support, featuring better performance, by eliminating restriction 2. 
Using 64KB blocks, the new version generates files more than twice smaller than 
the previous one.

The funny thing is that, the new version is now much better than cp 
--sparse=always

Here is an example.
The current chain of test uses datagen to generate sparse data.
So when doing :
./datagen -g50M -P100 > sparse.bin
cp --sparse=always sparse.bin sparsecp
./lz4 -B4 sparse.bin | ./lz4 -dX > sparselz4
ls -ls sparse*

51200 -rw-r--r-- 1 yann yann 52428800 mars  13 14:53 sparse.bin
 5636 -rw-r--r-- 1 yann yann 52428800 mars  13 14:53 sparsecp
 1796 -rw-r--r-- 1 yann yann 52428800 mars  13 14:54 sparselz4

As can be seen, the number of allocated KB is 3x smaller using lz4 -dX as 
compared with cp --sparse=always. I'm not sure why, but it's good news.

(yes, I checked with diff, and there is no problem with sparselz4 :
diff -s sparse.bin sparselz4
Files sparse.bin and sparselz4 are identical
)

The current algorithm limitation is that it only tests zero-filled segment at 
the beginning of each block. Consequently, larger block produce larger files, 
because later zero-filled segments are not tested :

ls -ls sparse*

51200 -rw-r--r-- 1 yann yann 52428800 mars  13 14:53 sparse.bin
 1796 -rw-r--r-- 1 yann yann 52428800 mars  13 14:54 sparselz4B4
 9640 -rw-r--r-- 1 yann yann 52428800 mars  13 15:05 sparselz4B5
27988 -rw-r--r-- 1 yann yann 52428800 mars  13 15:05 sparselz4B6
46104 -rw-r--r-- 1 yann yann 52428800 mars  13 15:05 sparselz4B7

The difference is fairly noticeable.
It pushes in favor of using B4 (or B4D) with sparse support enabled.
Note however that current default mode is B7.

Which means, there are 2 solutions :
- B4D will become default mode (which I plan to do in a later version, so 
basically in conjunction with making sparse support enabled by default), and 
the inefficiency at B7 setting is not considered important anymore, because 
basically no one will continue using B7
- The algorithm needs to be refined to address this weakness. There is a risk 
of increased CPU load in this process, so a fine balance must be found.

For your comments

Original comment by yann.col...@gmail.com on 13 Mar 2015 at 2:13

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I finally decided to tackle the issue with difference in performance depending 
on block size, by simply defining sub-blocks.
The result is now pretty good (uploaded at sparseFile branch) :

ls -ls sparse*
51200 -rw-r--r-- 1 yann yann 52428800 mars  13 20:09 sparse.bin
  552 -rw-r--r-- 1 yann yann 52428800 mars  13 20:28 sparselz4B4
  552 -rw-r--r-- 1 yann yann 52428800 mars  13 20:29 sparselz4B5
  552 -rw-r--r-- 1 yann yann 52428800 mars  13 20:29 sparselz4B6
  552 -rw-r--r-- 1 yann yann 52428800 mars  13 20:29 sparselz4B7

Identical result with all block sizes, and improved performance.
Obviously, the performance is excellent here because the generated file is 
*very* sparse.
Nonetheless, at 10x the performance of cp --sparse=always, I figure it's good 
enough.

It's even able to get some small parse segments from larger archives by 
accident, such as, for example, silesia.tar :

206984 -r--r--r-- 1 yann yann 211948032 juin  13  2013 silesia.tar
206932 -rw-r--r-- 1 yann yann 211948032 mars  13 20:38 silesia.tar.sparse

It's quite minor, but worth noting, especially as cp --sparse=always doesn't 
get same success.

Anyway, it still leaves on the table point 1 and 3 :

1) Windows support
Should be straightforward, it's just necessary to follow Takayuki's 
instructions.

2) Performance measurement
This one is much more tricky.
It's correct to say I've not been able to get something meaningful so far.
Everything is so fast.
At best, I can merely tell that I'm unable to witness any slowdown introduced 
by enabling sparse support, even on non-sparse files. But that's not a great 
statement.

Original comment by yann.col...@gmail.com on 13 Mar 2015 at 7:49

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

minor detail, but I note that tar is using -S for sparse file support.
Problem being : I'm using -Sx for stream checksum disabling...

Original comment by yann.col...@gmail.com on 13 Mar 2015 at 8:15

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Windows sparse file support has just been added, following your instructions 
Takayuki.

It works.

For some reason though, it's less efficient than Linux.
The 50MB sparse file I'm using for test goes down to 6.18MB,
which is respectably less, but still a lot more than the 550KB of Linux.

I was surprised because NTFS block size is 4 KB, so it should be able to use 
that kind of granularity.

It seems not. My understanding is that Windows will only "not allocated" clean, 
aligned blocks of 64 KB zeros, leaving a ton of smaller opportunities on the 
table. Hence inflated results compared to Linux ext4.

But well, on the plus side, that's nonetheless an improvement.

Original comment by yann.col...@gmail.com on 13 Mar 2015 at 9:20

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

# Performance measurement

Revamped my benchmark : https://gist.github.com/t-mat/47890e581bc624a4fa3f
New dev branch (8a87769af8) is fastest :)

# cp --sparseAlways

I've investigated it. And I learned :
GNU cp's is_nul() returns boolean value.
Since it just checks "Buffer is entirely zero or not", spase area is always 
quantized to buffer size.

Call tree is :
 coreutils/src/cp.c : main() -> do_copy()
 -> copy.c : copy() -> copy_internal() -> extent_copy() or copy_reg() -> sparse_copy()
 -> system.h : is_nul()

# Option symbols

Perhaps, it is good time to introduce long options like "--sparse" for advanced 
or minor switches.
Currently lz4cli has 18 options and it is still increasing, we'll soon see 
uniqueness problem.
So define only long options for now, and wait for increasing of demand to short 
option.

Original comment by takayuki...@gmail.com on 15 Mar 2015 at 3:44

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Thanks for testings Takayuki.
It gives some good reference to compare lz4 implementation against.

Regarding option symbols :
I feel you are right.
Maybe it's time to introduce long symbols.
I'll open an entry for this objective.

Original comment by yann.col...@gmail.com on 15 Mar 2015 at 7:11

Added labels: ****
Removed labels: ****

boludoz / lz4

lz4 cli should support sparse file #155