Performance regression (decompress) from 1.4.0 -> 1.4.3

KBentley57 commented 5 years ago

All,

I was excited to test the new release (1.4.3) in our companies code, expecting the gains touted in the release notes (7% average, if that is correct) compared the the version I'm using, 1.4.0. I made a sample file and compressed it with a few standard options, level 17 for compression as I've found the smallest ratios with that level for my data.

The file was compressed with zstd built from source using version 1.4.0, with GCC 8 on CentOS 7, using the same flags on both versions. The data size is about 365 MB uncompressed. Compressed, the file is around 215 MB. I put the file in /dev/shm in attempt to isolate the IO and ran a simple script to time the decompression of the file, delete the uncompressed output, and repeat. The time was reported as the real output from the bash time command. The descriptive statistics between the two experiments are summarized in the table below.

Can anyone comment on why I may be seeing a significant increase in decompression time? The order is on 10%. I am afraid I cannot share the file that was being compressed, but it seems somewhat immaterial.

Statistic	1.4.0	1.4.3
Mean	2.153	2.358
Standard Error	0.003	0.002
Mode	2.143	2.365
Median	2.151	2.369
First Quartile	2.136	2.364
Third Quartile	2.176	2.375
Variance	0.007	0.003
Standard Deviation	0.081	0.052
Kurtosis	13.845	29.457
Skewness	2.017	1.228
Range	0.934	0.807
Minimum	1.977	2.206
Maximum	2.911	3.013
Sum	2153.070	2358.416
Count	1000	1000

Cyan4973 commented 5 years ago

Can anyone comment on why I may be seeing a significant increase in decompression speed?

Did you mean an increase in decompression time (aka, slower) ?

felixhandte commented 5 years ago

Can you include those compilation flags you're using and what CPU you're seeing this on?

KBentley57 commented 5 years ago

Can anyone comment on why I may be seeing a significant increase in decompression speed?

Did you mean an increase in decompression time (aka, slower) ?

Yes, sorry for the mixed wording. It takes longer to decompress the same file with 1.4.3, than it does with 1.4.0. The times listed in that table are measured in seconds.

Can you include those compilation flags you're useing and what CPU you're seeing this on?

I'm building zstd with zlib (1.2.11) and lzma (5.2.4) in combination with many other parts of code via a cmake super-build. The output the I'm seeing is the normal "Release" build flags in the logs,

# compile C with /opt/rh/devtoolset-8/root/usr/bin/gcc
C_FLAGS =  -std=c99 -Wall -Wextra -Wundef -Wshadow -Wcast-align -Wcast-qual -Wstrict-prototypes -O2 -DNDEBUG  

C_DEFINES = -DXXH_NAMESPACE=ZSTD_ -DZSTD_GZCOMPRESS -DZSTD_GZDECOMPRESS -DZSTD_LEGACY_SUPPORT=0 -DZSTD_LZMACOMPRESS -DZSTD_LZMADECOMPRESS -DZSTD_MULTITHREAD

Concerning the system, it's an older Xeon, lacking AVX or AVX2. This is the output of $ cat /proc/cpuinfo:

vendor_id   : GenuineIntel
cpu family  : 6
model       : 44
model name  : Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping    : 2
microcode   : 0x1f
cpu MHz     : 2394.248
cache size  : 12288 KB
physical id : 0
siblings    : 8
core id     : 10
cpu cores   : 4
apicid      : 21
initial apicid  : 21
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid dtherm ida arat spec_ctrl intel_stibp flush_l1d
bogomips    : 4788.49
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

I'd be happy to provide more info, just let me know what's needed.

Thanks,

Kyle

Cyan4973 commented 5 years ago

A little known feature of zstd internal benchmark is that it can benchmark only decompression speed. For that, you'll need to load a *.zst compressed file, and use command -b -d. This is useful when measuring decompression speed on files compressed using high compression levels, as compression times can be punishing, especially on large files.

The advantage is that the in-memory benchmark is free of I/O side-effects, which can dominate results at high speed, and uses a very precise timer.

I made a comparison of v1.4.0 vs v1.4.3 decompression speeds using this technique on a desktop system using a Core i7-9700k , compiling with gcc v8.3.0 and using -O2 optimization flag (trying to reproduce @KBentley57's scenario, the default is -O3). The files are individual components of the silesia corpus compressed at level 17.

file	v1.4.0	v1.4.3	diff
dickens	792	795	0.38%
mozilla	837	823	-1.67%
mr	753	758	0.66%
nci	2183	2228	2.06%
ooffice	619	614	-0.81%
osdb	1057	1050	-0.66%
reymont	982	990	0.81%
samba	1314	1308	-0.46%
sao	626	628	0.32%
webster	925	940	1.62%
xml	1848	1845	-0.16%
x-ray	465	466	0.22%
silesia.tar	950	954	0.42%

The results are so-so, aka the compression benefits are not clearly present. It's roughly the same, maybe very slightly faster (mostly for nci).

This surprised me, so I re-run the test using -O3 (the default) :

file	v1.4.0	v1.4.3	diff
dickens	729	775	6.31%
mozilla	795	827	4.03%
mr	698	741	6.16%
nci	2172	2236	2.95%
ooffice	583	611	4.80%
osdb	991	1046	5.55%
reymont	918	971	5.77%
samba	1242	1300	4.67%
sao	589	631	7.13%
webster	861	923	7.20%
xml	1774	1841	3.78%
x-ray	429	456	6.29%
silesia.tar	908	948	4.41%

Now we are talking. The gains are more visible. It's a bit short of the 7% advertised, but it's definitely there.

But there is a bit more to it : compare both tables : moving from -O2 to -O3 is not a gain. Here is a comparison for v1.4.3 :

file	`-O2`	`-O3`	diff
dickens	795	775	-2.52%
mozilla	823	827	0.49%
mr	758	741	-2.24%
nci	2228	2236	0.36%
ooffice	614	611	-0.49%
osdb	1050	1046	-0.38%
reymont	990	971	-1.92%
samba	1308	1300	-0.61%
sao	628	631	0.48%
webster	940	923	-1.81%
xml	1845	1841	-0.22%
x-ray	466	456	-2.15%
silesia.tar	954	948	-0.63%

It's actually rather a loss ! Which means, transitively, that it must have been worse for v1.4.0 :

file	`-O2`	`-O3`	diff
dickens	792	729	-7.95%
mozilla	837	795	-5.02%
mr	753	698	-7.30%
nci	2183	2172	-0.50%
ooffice	619	583	-5.82%
osdb	1057	991	-6.24%
reymont	982	918	-6.52%
samba	1314	1242	-5.48%
sao	626	589	-5.91%
webster	925	861	-6.92%
xml	1848	1774	-4.00%
x-ray	465	429	-7.74%
silesia.tar	950	908	-4.42%

Yes, it was worse.

Conclusions :

-O3 is actually a bad setting (for decompression speed and gcc v8.3.0).
- Of course, it's unclear if this is a bad setting also for compression, or for other compilers (clang), or for different versions. That's a mess. Not being able to rely on -O3 > -O2 makes the situation a lot more complex.
The decompression speed gains offered in v1.4.1 effectively achieves parity between -O2 and -O3. That means that the most important contribution was to prevent the compiler from doing too much harm from trying too hard to be clever.
- This seems in line with the PR description : "I noticed that even though gcc is vectorizing the loop in in wildcopy" (https://github.com/facebook/zstd/pull/1668) (cc @mgrice)

These experiments explain why, at -O2 setting, there is no perceived benefit between v1.4.0 and v1.4.3, but it doesn't explain why @KBentley57's experiment perceives a sizable loss of performance.

I would suggest to try -b -d on your platform, and see if it reproduces the issue. If it does, we will have to look into the library, and find a sample which reproduces the issue. If it doesn't, then the issue could be in the CLI instead, or in I/O conditions.

FYI, I tried to time the CLI decompression performance on my test platform, but could not reproduce any sensible difference so far (noise measurement was higher than any potential difference between v1.4.0 and v1.4.3).

KBentley57 commented 5 years ago

@Cyan4973

That is some great insight! Thank you for looking into it so thoroughly.

After I had given it a little more thought, I was questioning why I wasn't compiling it at O3, instead of O2, to take advantage of vectorization. I'm glad you tested it as well. I will try that tomorrow, alongside gcc-{6,7,8} with O{2,3} and see if I can't help pin down the issue. I'm glad you reminded me of the benchmark mode. I knew it was in there, but it completely slipped my mind when I was testing this morning. I'll post a representative file too.

Thanks,

Kyle

mgrice commented 5 years ago

I think it's probable that for GCC and clang vectorization is almost entirely bad for zstd performance. It certainly is in any instance I looked at but I stopped short of disabling it completely. Compiler vectorization introduces high startup costs for the loop (checking for length, overlap) that have to be amortized against an assumed high trip count for the loop. In the case of zstd decoding, that average trip count is actually very low -- most likely 1.

I turned off auto-vectorization for decoding in PR1668 and replaced it with a hand-vectorized version that I wrote with processors >= sandy bridge (2012) in mind, for which 16-byte operations are not meaningfully more expensive than 8-byte ones. It is possible for a E5620, which is a bit older, that assumption is not valid.

KBentley57 commented 5 years ago

I had some time to run a few tests on my laptop at home, I'm afraid I haven't made the time for it yet at work. The results are interesting, and display similar results to what I observed with different data.

I used a representative sample of data that is about ~18 MB, (It's an OpenVDB grid, for the curious), and put it in /dev/shm again. I compiled 1.4.0 and 1.4.3 under the Release build, which adds O3, and the RelWithDebInfo which adds O2. My tests show that 1.4.0 compiled under O3 beats 1.4.3 under O3 by about 2-3% across the board, and 1.4.3 O2 loses to 1.4.0 O2 in levels 1-6, but beats it in levels >= 7.

Specs: Debian 10, Intel Core I7 5600U, GCC 8.3.0-6

First up is the compression tests. This one doesn't really affect me, but here are the results for the sake of completeness. It's worth noting that as suspected, O2 beats O3 in some cases.

The decompression test is next. Note that here I'm not timing the binary unzstd like in the original post, but using the results of the internal benchmark as suggested. Levels 1-19 were tested with the command on the plot. This is a single run, not a statistical analysis. Note that the Y-axis doesn't start at 0, don't be misled by the heights of the bars.

To highlight the differences, here's a plot of the percent difference between 1.4.0 and 1.4.3 when compiled under O3. The mean value is 2% in favor of 1.4.0, but for the most part, the difference is in the 3-4% range.

I'll do my best to carve out a few minutes at work to try this again, but I think the results here make the case that at least some reconsideration ought to be given to the hand-rolled vectorized loop in the 1.4.x patch, as this is a pretty modern cpu.

Here's the data in tabular form.

Compression
--
Level | 1.4.0 O3 | 1.4.0 O2 | 1.4.3 O3 | 1.4.3 O2
1 | 336.200 | 313.400 | 329.200 | 332.100
2 | 255.600 | 242.200 | 255.500 | 255.900
3 | 139.800 | 140.900 | 140.000 | 144.600
4 | 103.700 | 103.400 | 114.900 | 120.400
5 | 42.400 | 37.900 | 42.600 | 38.100
6 | 29.200 | 26.400 | 28.500 | 25.600
7 | 27.700 | 25.400 | 27.300 | 24.400
8 | 26.700 | 24.400 | 25.500 | 23.400
9 | 25.100 | 22.700 | 23.600 | 22.000
10 | 20.100 | 18.200 | 18.800 | 17.700
11 | 19.700 | 18.300 | 18.600 | 17.500
12 | 18.900 | 17.800 | 18.100 | 17.100
13 | 13.500 | 13.100 | 13.400 | 13.000
14 | 13.500 | 13.100 | 13.300 | 13.000
15 | 11.800 | 11.400 | 11.400 | 11.200
16 | 10.110 | 9.900 | 9.970 | 9.230
17 | 7.310 | 7.090 | 7.210 | 6.990
18 | 5.950 | 5.850 | 5.890 | 5.720
19 | 5.000 | 4.860 | 4.920 | 4.840


Decompression
--
Level | 1.4.0 O3 | 1.4.0 O2 | 1.4.3 O3 | 1.4.3 O2
1 | 775.200 | 755.600 | 777.600 | 754.200
2 | 753.100 | 733.400 | 752.400 | 713.600
3 | 743.300 | 710.200 | 719.900 | 693.500
4 | 735.000 | 690.100 | 710.200 | 684.000
5 | 719.200 | 683.400 | 693.600 | 676.900
6 | 720.000 | 685.900 | 699.900 | 685.100
7 | 735.600 | 696.000 | 711.600 | 701.000
8 | 733.900 | 693.600 | 710.300 | 703.600
9 | 734.100 | 695.300 | 708.300 | 708.900
10 | 723.100 | 689.000 | 702.100 | 705.400
11 | 714.600 | 685.800 | 698.400 | 703.700
12 | 717.300 | 687.100 | 698.600 | 705.500
13 | 718.500 | 691.400 | 704.300 | 709.800
14 | 715.000 | 689.400 | 702.600 | 709.200
15 | 721.700 | 692.000 | 709.200 | 717.200
16 | 712.300 | 683.600 | 697.700 | 705.800
17 | 610.700 | 609.600 | 615.200 | 625.300
18 | 455.100 | 464.500 | 453.900 | 463.300
19 | 417.200 | 428.500 | 427.400 | 439.100


Comparison
--
Level | O3 Delta
1 | -0.31%
2 | 0.09%
3 | 3.25%
4 | 3.49%
5 | 3.69%
6 | 2.87%
7 | 3.37%
8 | 3.32%
9 | 3.64%
10 | 2.99%
11 | 2.32%
12 | 2.68%
13 | 2.02%
14 | 1.76%
15 | 1.76%
16 | 2.09%
17 | -0.73%
18 | 0.26%
19 | -2.39%

The complete CPU specs - I'd also be interested to see how the recent intel bug fixes effect this.

vendor_id   : GenuineIntel
cpu family  : 6
model       : 61
model name  : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
stepping    : 4
microcode   : 0x2d
cpu MHz     : 1335.493
cache size  : 4096 KB
physical id : 0
siblings    : 4
core id     : 1
cpu cores   : 2
apicid      : 3
initial apicid  : 3
fpu     : yes
fpu_exception   : yes
cpuid level : 20
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap intel_pt xsaveopt dtherm ida arat pln pts md_clear flush_l1d
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips    : 5188.22
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

KBentley57 commented 5 years ago

I ran the same test as above on my work pc. The results are a little different here than I was expecting. The data is identical to the case above, but different from that in the original post. The results here show 1.4.0 and 1.4.3 being closer in terms of performance than the comparison on a newer PC. Here's the decompression results for discussion.

1.4.0 O3 beats or ties 1.4.3 O3 across a large range of levels, but not by a lot. Again, check the y-axis. Besides providing good evidence for an upgrade of my development rig, there's not much going on. It does go contrary of the reported gains though.

Shown here is the percent difference. The lead is small or none in more cases than not.

The specs of the PC are the same is in the first few posts.

Cyan4973 commented 5 years ago

Thanks @KBentley57 , that's interesting indeed, it shows the picture is not that clear.

I'm also impressed by the very large drop in decompression speed between level 16 and 19. That's something I'm not used to, though it could be sample specific.

edit : or maybe related to the 4 MB cache size of target cpu, since higher levels increase window size up to 8 MB, resulting in more cache misses.

KBentley57 commented 5 years ago

@Cyan4973 I want to stress that the axes could be misleading at first glance, in that the decompression speed doesn't approach zero, but in both cases, just a little over half the maximum speed of any of the other levels.

I'm a little perplexed, however. One one hand it doesn't really matter, since they're so close, but on my laptop there's clearly a performance difference between the two versions, at least for my type of data. On the other hand, the cumulation of small time / energy savings is significant over the course of a large montecarlo run consisting of > 1 million trials. Am I looking too hard at what is likely an unpredictable quantity?

I saw that there were a few commits on the PR page , have you done any investigations into this? Not pushing, just asking if anything obvious poked its head out.

Thanks,

Kyle

Cyan4973 commented 5 years ago

No, unfortunately, no easy conclusion here. It's likely worth an investigation, so we'll start one.

terrelln commented 5 years ago

@KBentley57 what is the compression ratio of your file?

I'm going to start the following benchmark on my machine, which produces very stable benchmark results, and I want to make sure to include a representative file.

#!/usr/bin/env sh

PROG="$0"
USAGE="$PROG 0zstd 1zstd FILES..."

PREFIX="taskset --cpu-list 0"

ZSTD0="$1"
shift
ZSTD1="$1"
shift

if [ "x$ZSTD0" == "x" ]; then
        echo $USAGE
        exit 1
fi

if [ "x$ZSTD1" == "x" ]; then
        echo $USAGE
        echi 1
fi

levels=$(seq 1 19)

echo "Compressing each file with each level"
for file in $@; do
        for level in $levels; do
                ofile="$file.zst.$level"
                if [ ! -f "$ofile" ]; then
                        $ZSTD1 "$file" -$level -o "$file.zst.$level"
                fi
        done
done

echo "Disabling Turbo"
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

for file in $@; do
        echo "Benchmarking on $file"
        for ZSTD in $ZSTD0 $ZSTD1; do
                echo $ZSTD
                for level in $levels; do
                        $ZSTD -b -d "$file.zst.$level"
                done
        done
done

echo "Enabling Turbo"
echo 0 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

terrelln commented 5 years ago

The difference in decompression speed between levels 16, 17, 18, and 19 can be explained by the minimum match length.

There must be a bit more 4 byte matches than 5 byte matches, and a lot more 3 byte matches.

terrelln commented 5 years ago

@KBentley57 would it be possible to regenerate a OpenVDB file containing data that you can share that has the same performance regression? We suspect it is either the file, or the CPU specs, but we're not sure which. Having the file would really help us narrow down the issue.

KBentley57 commented 5 years ago

@Cyan4973 Thanks, I'll do what I can to help out, if it's needed.

@terrelln I am working on getting the OK for that right now. I can't promise it'll be tomorrow, but by Friday afternoon I should have a few test cases that you can try. Generally, we see compression ratios of anywhere from 1.4 and up. That depends on a few parameters, but I'd put it in the range (1.4, 2.0).

KBentley57 commented 5 years ago

All,

Sorry for the delay. While I'm afraid I can't provide the actual data file yet, OpenVDB has many sample voxel files that are similar enough to my use case that I think should be sufficient. Here is one that is nearly identical in filesize and roughly the same density, etc..

https://nexus.aswf.io/content/repositories/releases/io/aswf/openvdb/models/smoke2.vdb/1.0.0/smoke2.vdb-1.0.0.zip

There are a few differences, notably that I'm using HalfFloat (An Industrial Light and Magic (ILM) component of OpenVDB), so that the leaf nodes in this structure would be twice as large. I'm not certain if that matters. The are storing two grids, their second grid is a vec3, whereas I'm storing a second grid of scalar values, again, HalfFloat. The compression ratios for this file aren't nearly as high as my data, achieving somewhere around 1.1

Here's a few of my numbers on the old work machine - The case still remains that 1.4.0 O2 beats 1.4.3 O3 under a significant number of cases, though not by much. Under 1.4.3 O2 wins a few loses a few, etc.

Here's a look at the differences between the two version when compiled under O2 vs O3 (1.4.0 - 1.4.3).

@terrelln I effectively ran your script, and experienced similar results. BTW, I'm not sure if was copy/paste error, but there's a typo in the second if [ ... ] fi block where echi should be exit

And finally, the tabular data in case it reveals anything more -

	1.4.0 O3	1.4.0 O2	1.4.3 O3	1.4.3 O2
1	2592.3	2649.3	2578.2	2561.8
2	2348.3	2396.2	2355.3	2355.3
3	2204.3	2226.3	2218	2116.2
4	2174.1	2174.1	2152.8	2187.9
5	2085	2106.5	2053.3	1980.8
6	2044.3	2085	2064	2104.3
7	2209.5	2263.7	2218	2236.5
8	2150.9	2302.3	2256.8	2294.9
9	2302.9	2325.1	2287.3	2341.2
10	2271.8	2294.9	2264.7	2287.3
11	2248.8	2264.7	2264.7	2240.6
12	2294.9	2302.3	2279.6	2295.5
13	2317.8	2325.1	2348.3	2372
14	2302.3	2336	2355.3	2372
15	2325.1	2355.3	2378.8	2409.3
16	2226.3	2240.6	2287.3	2256.8
17	2054.6	2073.8	2125.6	2052.2
18	1358.8	1405.3	1402.9	1280.9
19	1465.1	1495.6	1421.3	1433.4

Cyan4973 commented 5 years ago

We have a project to look more into these -O2 / -O3 differences. It should start soon, and will give us a better visibility on what's going on and what are the best settings.

On the specific issue of performance comparison between 1.4.0 and 1.4.3, the topic is becoming obsolete, because the next version in v1.4.4 is featuring substantial differences for the decompression algorithm, resulting in a dramatically better speed. As a consequence, the code already merged in dev branch is no longer comparable.

I would suggest a test of the current code in dev to ensure that it does indeed provide better performance for your setup too.

KBentley57 commented 5 years ago

@Cyan4973

I'll give that a look sometime soon. If that's the case, should this issue be marked as closed? I don't want it to run on forever now that it has fulfilled its purpose.

Thanks,

Kyle

facebook / zstd

Performance regression (decompress) from 1.4.0 -> 1.4.3 #1758