Closed KBentley57 closed 5 years ago
Can anyone comment on why I may be seeing a significant increase in decompression speed?
Did you mean an increase in decompression time (aka, slower) ?
Can you include those compilation flags you're using and what CPU you're seeing this on?
Can anyone comment on why I may be seeing a significant increase in decompression speed?
Did you mean an increase in decompression time (aka, slower) ?
Yes, sorry for the mixed wording. It takes longer to decompress the same file with 1.4.3, than it does with 1.4.0. The times listed in that table are measured in seconds.
Can you include those compilation flags you're useing and what CPU you're seeing this on?
I'm building zstd with zlib (1.2.11) and lzma (5.2.4) in combination with many other parts of code via a cmake super-build. The output the I'm seeing is the normal "Release" build flags in the logs,
# compile C with /opt/rh/devtoolset-8/root/usr/bin/gcc
C_FLAGS = -std=c99 -Wall -Wextra -Wundef -Wshadow -Wcast-align -Wcast-qual -Wstrict-prototypes -O2 -DNDEBUG
C_DEFINES = -DXXH_NAMESPACE=ZSTD_ -DZSTD_GZCOMPRESS -DZSTD_GZDECOMPRESS -DZSTD_LEGACY_SUPPORT=0 -DZSTD_LZMACOMPRESS -DZSTD_LZMADECOMPRESS -DZSTD_MULTITHREAD
Concerning the system, it's an older Xeon, lacking AVX or AVX2. This is the output of $ cat /proc/cpuinfo:
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
stepping : 2
microcode : 0x1f
cpu MHz : 2394.248
cache size : 12288 KB
physical id : 0
siblings : 8
core id : 10
cpu cores : 4
apicid : 21
initial apicid : 21
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid dtherm ida arat spec_ctrl intel_stibp flush_l1d
bogomips : 4788.49
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
I'd be happy to provide more info, just let me know what's needed.
Thanks,
Kyle
A little known feature of zstd
internal benchmark is that it can benchmark only decompression speed. For that, you'll need to load a *.zst
compressed file, and use command -b -d
. This is useful when measuring decompression speed on files compressed using high compression levels, as compression times can be punishing, especially on large files.
The advantage is that the in-memory benchmark is free of I/O side-effects, which can dominate results at high speed, and uses a very precise timer.
I made a comparison of v1.4.0
vs v1.4.3
decompression speeds using this technique on a desktop system using a Core i7-9700k , compiling with gcc v8.3.0
and using -O2
optimization flag (trying to reproduce @KBentley57's scenario, the default is -O3
). The files are individual components of the silesia
corpus compressed at level 17.
file | v1.4.0 | v1.4.3 | diff |
---|---|---|---|
dickens | 792 | 795 | 0.38% |
mozilla | 837 | 823 | -1.67% |
mr | 753 | 758 | 0.66% |
nci | 2183 | 2228 | 2.06% |
ooffice | 619 | 614 | -0.81% |
osdb | 1057 | 1050 | -0.66% |
reymont | 982 | 990 | 0.81% |
samba | 1314 | 1308 | -0.46% |
sao | 626 | 628 | 0.32% |
webster | 925 | 940 | 1.62% |
xml | 1848 | 1845 | -0.16% |
x-ray | 465 | 466 | 0.22% |
silesia.tar | 950 | 954 | 0.42% |
The results are so-so, aka the compression benefits are not clearly present. It's roughly the same, maybe very slightly faster (mostly for nci
).
This surprised me, so I re-run the test using -O3
(the default) :
file | v1.4.0 | v1.4.3 | diff |
---|---|---|---|
dickens | 729 | 775 | 6.31% |
mozilla | 795 | 827 | 4.03% |
mr | 698 | 741 | 6.16% |
nci | 2172 | 2236 | 2.95% |
ooffice | 583 | 611 | 4.80% |
osdb | 991 | 1046 | 5.55% |
reymont | 918 | 971 | 5.77% |
samba | 1242 | 1300 | 4.67% |
sao | 589 | 631 | 7.13% |
webster | 861 | 923 | 7.20% |
xml | 1774 | 1841 | 3.78% |
x-ray | 429 | 456 | 6.29% |
silesia.tar | 908 | 948 | 4.41% |
Now we are talking. The gains are more visible. It's a bit short of the 7% advertised, but it's definitely there.
But there is a bit more to it : compare both tables : moving from -O2
to -O3
is not a gain. Here is a comparison for v1.4.3
:
file | -O2 |
-O3 |
diff |
---|---|---|---|
dickens | 795 | 775 | -2.52% |
mozilla | 823 | 827 | 0.49% |
mr | 758 | 741 | -2.24% |
nci | 2228 | 2236 | 0.36% |
ooffice | 614 | 611 | -0.49% |
osdb | 1050 | 1046 | -0.38% |
reymont | 990 | 971 | -1.92% |
samba | 1308 | 1300 | -0.61% |
sao | 628 | 631 | 0.48% |
webster | 940 | 923 | -1.81% |
xml | 1845 | 1841 | -0.22% |
x-ray | 466 | 456 | -2.15% |
silesia.tar | 954 | 948 | -0.63% |
It's actually rather a loss ! Which means, transitively, that it must have been worse for v1.4.0
:
file | -O2 |
-O3 |
diff |
---|---|---|---|
dickens | 792 | 729 | -7.95% |
mozilla | 837 | 795 | -5.02% |
mr | 753 | 698 | -7.30% |
nci | 2183 | 2172 | -0.50% |
ooffice | 619 | 583 | -5.82% |
osdb | 1057 | 991 | -6.24% |
reymont | 982 | 918 | -6.52% |
samba | 1314 | 1242 | -5.48% |
sao | 626 | 589 | -5.91% |
webster | 925 | 861 | -6.92% |
xml | 1848 | 1774 | -4.00% |
x-ray | 465 | 429 | -7.74% |
silesia.tar | 950 | 908 | -4.42% |
Yes, it was worse.
Conclusions :
-O3
is actually a bad setting (for decompression speed and gcc v8.3.0
).
clang
), or for different versions. That's a mess. Not being able to rely on -O3
> -O2
makes the situation a lot more complex.-O2
and -O3
. That means that the most important contribution was to prevent the compiler from doing too much harm from trying too hard to be clever.
These experiments explain why, at -O2
setting, there is no perceived benefit between v1.4.0
and v1.4.3
, but it doesn't explain why @KBentley57's experiment perceives a sizable loss of performance.
I would suggest to try -b -d
on your platform, and see if it reproduces the issue.
If it does, we will have to look into the library, and find a sample which reproduces the issue.
If it doesn't, then the issue could be in the CLI instead, or in I/O conditions.
FYI, I tried to time the CLI decompression performance on my test platform, but could not reproduce any sensible difference so far (noise measurement was higher than any potential difference between v1.4.0
and v1.4.3
).
@Cyan4973
That is some great insight! Thank you for looking into it so thoroughly.
After I had given it a little more thought, I was questioning why I wasn't compiling it at O3, instead of O2, to take advantage of vectorization. I'm glad you tested it as well. I will try that tomorrow, alongside gcc-{6,7,8} with O{2,3} and see if I can't help pin down the issue. I'm glad you reminded me of the benchmark mode. I knew it was in there, but it completely slipped my mind when I was testing this morning. I'll post a representative file too.
Thanks,
Kyle
I think it's probable that for GCC and clang vectorization is almost entirely bad for zstd performance. It certainly is in any instance I looked at but I stopped short of disabling it completely. Compiler vectorization introduces high startup costs for the loop (checking for length, overlap) that have to be amortized against an assumed high trip count for the loop. In the case of zstd decoding, that average trip count is actually very low -- most likely 1.
I turned off auto-vectorization for decoding in PR1668 and replaced it with a hand-vectorized version that I wrote with processors >= sandy bridge (2012) in mind, for which 16-byte operations are not meaningfully more expensive than 8-byte ones. It is possible for a E5620, which is a bit older, that assumption is not valid.
I had some time to run a few tests on my laptop at home, I'm afraid I haven't made the time for it yet at work. The results are interesting, and display similar results to what I observed with different data.
I used a representative sample of data that is about ~18 MB, (It's an OpenVDB grid, for the curious), and put it in /dev/shm again. I compiled 1.4.0 and 1.4.3 under the Release build, which adds O3, and the RelWithDebInfo which adds O2. My tests show that 1.4.0 compiled under O3 beats 1.4.3 under O3 by about 2-3% across the board, and 1.4.3 O2 loses to 1.4.0 O2 in levels 1-6, but beats it in levels >= 7.
Specs: Debian 10, Intel Core I7 5600U, GCC 8.3.0-6
First up is the compression tests. This one doesn't really affect me, but here are the results for the sake of completeness. It's worth noting that as suspected, O2 beats O3 in some cases.
The decompression test is next. Note that here I'm not timing the binary unzstd like in the original post, but using the results of the internal benchmark as suggested. Levels 1-19 were tested with the command on the plot. This is a single run, not a statistical analysis. Note that the Y-axis doesn't start at 0, don't be misled by the heights of the bars.
To highlight the differences, here's a plot of the percent difference between 1.4.0 and 1.4.3 when compiled under O3. The mean value is 2% in favor of 1.4.0, but for the most part, the difference is in the 3-4% range.
I'll do my best to carve out a few minutes at work to try this again, but I think the results here make the case that at least some reconsideration ought to be given to the hand-rolled vectorized loop in the 1.4.x patch, as this is a pretty modern cpu.
Here's the data in tabular form.
Compression
--
Level | 1.4.0 O3 | 1.4.0 O2 | 1.4.3 O3 | 1.4.3 O2
1 | 336.200 | 313.400 | 329.200 | 332.100
2 | 255.600 | 242.200 | 255.500 | 255.900
3 | 139.800 | 140.900 | 140.000 | 144.600
4 | 103.700 | 103.400 | 114.900 | 120.400
5 | 42.400 | 37.900 | 42.600 | 38.100
6 | 29.200 | 26.400 | 28.500 | 25.600
7 | 27.700 | 25.400 | 27.300 | 24.400
8 | 26.700 | 24.400 | 25.500 | 23.400
9 | 25.100 | 22.700 | 23.600 | 22.000
10 | 20.100 | 18.200 | 18.800 | 17.700
11 | 19.700 | 18.300 | 18.600 | 17.500
12 | 18.900 | 17.800 | 18.100 | 17.100
13 | 13.500 | 13.100 | 13.400 | 13.000
14 | 13.500 | 13.100 | 13.300 | 13.000
15 | 11.800 | 11.400 | 11.400 | 11.200
16 | 10.110 | 9.900 | 9.970 | 9.230
17 | 7.310 | 7.090 | 7.210 | 6.990
18 | 5.950 | 5.850 | 5.890 | 5.720
19 | 5.000 | 4.860 | 4.920 | 4.840
Decompression
--
Level | 1.4.0 O3 | 1.4.0 O2 | 1.4.3 O3 | 1.4.3 O2
1 | 775.200 | 755.600 | 777.600 | 754.200
2 | 753.100 | 733.400 | 752.400 | 713.600
3 | 743.300 | 710.200 | 719.900 | 693.500
4 | 735.000 | 690.100 | 710.200 | 684.000
5 | 719.200 | 683.400 | 693.600 | 676.900
6 | 720.000 | 685.900 | 699.900 | 685.100
7 | 735.600 | 696.000 | 711.600 | 701.000
8 | 733.900 | 693.600 | 710.300 | 703.600
9 | 734.100 | 695.300 | 708.300 | 708.900
10 | 723.100 | 689.000 | 702.100 | 705.400
11 | 714.600 | 685.800 | 698.400 | 703.700
12 | 717.300 | 687.100 | 698.600 | 705.500
13 | 718.500 | 691.400 | 704.300 | 709.800
14 | 715.000 | 689.400 | 702.600 | 709.200
15 | 721.700 | 692.000 | 709.200 | 717.200
16 | 712.300 | 683.600 | 697.700 | 705.800
17 | 610.700 | 609.600 | 615.200 | 625.300
18 | 455.100 | 464.500 | 453.900 | 463.300
19 | 417.200 | 428.500 | 427.400 | 439.100
Comparison
--
Level | O3 Delta
1 | -0.31%
2 | 0.09%
3 | 3.25%
4 | 3.49%
5 | 3.69%
6 | 2.87%
7 | 3.37%
8 | 3.32%
9 | 3.64%
10 | 2.99%
11 | 2.32%
12 | 2.68%
13 | 2.02%
14 | 1.76%
15 | 1.76%
16 | 2.09%
17 | -0.73%
18 | 0.26%
19 | -2.39%
The complete CPU specs - I'd also be interested to see how the recent intel bug fixes effect this.
vendor_id : GenuineIntel
cpu family : 6
model : 61
model name : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
stepping : 4
microcode : 0x2d
cpu MHz : 1335.493
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap intel_pt xsaveopt dtherm ida arat pln pts md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips : 5188.22
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
I ran the same test as above on my work pc. The results are a little different here than I was expecting. The data is identical to the case above, but different from that in the original post. The results here show 1.4.0 and 1.4.3 being closer in terms of performance than the comparison on a newer PC. Here's the decompression results for discussion.
1.4.0 O3 beats or ties 1.4.3 O3 across a large range of levels, but not by a lot. Again, check the y-axis. Besides providing good evidence for an upgrade of my development rig, there's not much going on. It does go contrary of the reported gains though.
Shown here is the percent difference. The lead is small or none in more cases than not.
The specs of the PC are the same is in the first few posts.
Thanks @KBentley57 , that's interesting indeed, it shows the picture is not that clear.
I'm also impressed by the very large drop in decompression speed between level 16 and 19. That's something I'm not used to, though it could be sample specific.
edit : or maybe related to the 4 MB cache size of target cpu, since higher levels increase window size up to 8 MB, resulting in more cache misses.
@Cyan4973 I want to stress that the axes could be misleading at first glance, in that the decompression speed doesn't approach zero, but in both cases, just a little over half the maximum speed of any of the other levels.
I'm a little perplexed, however. One one hand it doesn't really matter, since they're so close, but on my laptop there's clearly a performance difference between the two versions, at least for my type of data. On the other hand, the cumulation of small time / energy savings is significant over the course of a large montecarlo run consisting of > 1 million trials. Am I looking too hard at what is likely an unpredictable quantity?
I saw that there were a few commits on the PR page , have you done any investigations into this? Not pushing, just asking if anything obvious poked its head out.
Thanks,
Kyle
No, unfortunately, no easy conclusion here. It's likely worth an investigation, so we'll start one.
@KBentley57 what is the compression ratio of your file?
I'm going to start the following benchmark on my machine, which produces very stable benchmark results, and I want to make sure to include a representative file.
#!/usr/bin/env sh
PROG="$0"
USAGE="$PROG 0zstd 1zstd FILES..."
PREFIX="taskset --cpu-list 0"
ZSTD0="$1"
shift
ZSTD1="$1"
shift
if [ "x$ZSTD0" == "x" ]; then
echo $USAGE
exit 1
fi
if [ "x$ZSTD1" == "x" ]; then
echo $USAGE
echi 1
fi
levels=$(seq 1 19)
echo "Compressing each file with each level"
for file in $@; do
for level in $levels; do
ofile="$file.zst.$level"
if [ ! -f "$ofile" ]; then
$ZSTD1 "$file" -$level -o "$file.zst.$level"
fi
done
done
echo "Disabling Turbo"
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
for file in $@; do
echo "Benchmarking on $file"
for ZSTD in $ZSTD0 $ZSTD1; do
echo $ZSTD
for level in $levels; do
$ZSTD -b -d "$file.zst.$level"
done
done
done
echo "Enabling Turbo"
echo 0 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
The difference in decompression speed between levels 16, 17, 18, and 19 can be explained by the minimum match length.
There must be a bit more 4 byte matches than 5 byte matches, and a lot more 3 byte matches.
@KBentley57 would it be possible to regenerate a OpenVDB file containing data that you can share that has the same performance regression? We suspect it is either the file, or the CPU specs, but we're not sure which. Having the file would really help us narrow down the issue.
@Cyan4973 Thanks, I'll do what I can to help out, if it's needed.
@terrelln I am working on getting the OK for that right now. I can't promise it'll be tomorrow, but by Friday afternoon I should have a few test cases that you can try. Generally, we see compression ratios of anywhere from 1.4 and up. That depends on a few parameters, but I'd put it in the range (1.4, 2.0).
All,
Sorry for the delay. While I'm afraid I can't provide the actual data file yet, OpenVDB has many sample voxel files that are similar enough to my use case that I think should be sufficient. Here is one that is nearly identical in filesize and roughly the same density, etc..
There are a few differences, notably that I'm using HalfFloat (An Industrial Light and Magic (ILM) component of OpenVDB), so that the leaf nodes in this structure would be twice as large. I'm not certain if that matters. The are storing two grids, their second grid is a vec3
Here's a few of my numbers on the old work machine - The case still remains that 1.4.0 O2 beats 1.4.3 O3 under a significant number of cases, though not by much. Under 1.4.3 O2 wins a few loses a few, etc.
Here's a look at the differences between the two version when compiled under O2 vs O3 (1.4.0 - 1.4.3).
@terrelln I effectively ran your script, and experienced similar results. BTW, I'm not sure if was copy/paste error, but there's a typo in the second if [ ... ] fi
block where echi
should be exit
And finally, the tabular data in case it reveals anything more -
1.4.0 O3 | 1.4.0 O2 | 1.4.3 O3 | 1.4.3 O2 | |
---|---|---|---|---|
1 | 2592.3 | 2649.3 | 2578.2 | 2561.8 |
2 | 2348.3 | 2396.2 | 2355.3 | 2355.3 |
3 | 2204.3 | 2226.3 | 2218 | 2116.2 |
4 | 2174.1 | 2174.1 | 2152.8 | 2187.9 |
5 | 2085 | 2106.5 | 2053.3 | 1980.8 |
6 | 2044.3 | 2085 | 2064 | 2104.3 |
7 | 2209.5 | 2263.7 | 2218 | 2236.5 |
8 | 2150.9 | 2302.3 | 2256.8 | 2294.9 |
9 | 2302.9 | 2325.1 | 2287.3 | 2341.2 |
10 | 2271.8 | 2294.9 | 2264.7 | 2287.3 |
11 | 2248.8 | 2264.7 | 2264.7 | 2240.6 |
12 | 2294.9 | 2302.3 | 2279.6 | 2295.5 |
13 | 2317.8 | 2325.1 | 2348.3 | 2372 |
14 | 2302.3 | 2336 | 2355.3 | 2372 |
15 | 2325.1 | 2355.3 | 2378.8 | 2409.3 |
16 | 2226.3 | 2240.6 | 2287.3 | 2256.8 |
17 | 2054.6 | 2073.8 | 2125.6 | 2052.2 |
18 | 1358.8 | 1405.3 | 1402.9 | 1280.9 |
19 | 1465.1 | 1495.6 | 1421.3 | 1433.4 |
We have a project to look more into these -O2
/ -O3
differences.
It should start soon, and will give us a better visibility on what's going on and what are the best settings.
On the specific issue of performance comparison between 1.4.0 and 1.4.3, the topic is becoming obsolete, because the next version in v1.4.4 is featuring substantial differences for the decompression algorithm, resulting in a dramatically better speed. As a consequence, the code already merged in dev
branch is no longer comparable.
I would suggest a test of the current code in dev
to ensure that it does indeed provide better performance for your setup too.
@Cyan4973
I'll give that a look sometime soon. If that's the case, should this issue be marked as closed? I don't want it to run on forever now that it has fulfilled its purpose.
Thanks,
Kyle
All,
I was excited to test the new release (1.4.3) in our companies code, expecting the gains touted in the release notes (7% average, if that is correct) compared the the version I'm using, 1.4.0. I made a sample file and compressed it with a few standard options, level 17 for compression as I've found the smallest ratios with that level for my data.
The file was compressed with zstd built from source using version 1.4.0, with GCC 8 on CentOS 7, using the same flags on both versions. The data size is about 365 MB uncompressed. Compressed, the file is around 215 MB. I put the file in /dev/shm in attempt to isolate the IO and ran a simple script to time the decompression of the file, delete the uncompressed output, and repeat. The time was reported as the real output from the bash time command. The descriptive statistics between the two experiments are summarized in the table below.
Can anyone comment on why I may be seeing a significant increase in decompression time? The order is on 10%. I am afraid I cannot share the file that was being compressed, but it seems somewhat immaterial.