Performance drop on low end / mobile architectures

g1mv commented 11 years ago

Issue by gpnuma from Tuesday Aug 13, 2013 at 22:33 GMT Originally opened as https://github.com/centaurean/sharc/issues/10

According to these benchmarks from @nemequ (https://github.com/quixdb/squash) : http://quixdb.github.io/squash/benchmarks/core-i5-2400.html#enwik8 http://quixdb.github.io/squash/benchmarks/atom-d525.html#enwik8 SHARC seems to have a significant performance drop in comparison with other algorithms on the Atom platform, while being way ahead in compression speed on an Intel Core i5.

g1mv commented 11 years ago

Comment by lesnake from Wednesday Aug 14, 2013 at 09:12 GMT

Perhaps these pages will contain some clues : Core I5 2400 spec ; cache size : 6Mb Atom d252 spec : cache size : 1Mb AMD 4170 HE : cache size: 6Mb

The performance drop occurs with the processor with the lowest cache size.

Sharc may be more sensitive to processor cache size and performance than its squash competitors

Since, to a close release, sharc used big memory buffers (2x4Mb buffers with -c0 3x4Mb buffers with -c1).

AFIK, the release that use only 1Mb buffers does much better, still not enough.

I am trying to build squash on my 3 Mb core 2 duo T8100... virtualized linux.

Pierre

g1mv commented 11 years ago

Comment by gpnuma from Wednesday Aug 14, 2013 at 10:59 GMT

Yes that is interesting, as the buffers are now even smaller : 256K ! As @nemequ states, the benchmarks are not I/O limited

Both systems have 4 GB RAM. The rest is described at http://quixdb.github.io/squash/#benchmarks

The benchmarks use CPU time, not wall-clock, so I/O isn't a factor. Furthermore, all the algorithms use the same setup (see https://github.com/quixdb/squash/blob/master/benchmark/benchmark.c for the source code), and the interesting part is the drop in performance relative to other algorithms on the same system (using the same rotational media, RAM, etc.). Compressing with SHARC goes from being twice as fast as its closest competitor (Snappy) on the i5-2400 to 12% slower than QuickLZ, and roughly the same speed as LZO and Snappy. Obviously some algorithms perform differently on different architectures, but the change in SHARC is extreme.

So it could indeed possibly be a processor cache related problem.

g1mv commented 11 years ago

Comment by nemequ from Wednesday Aug 14, 2013 at 11:22 GMT

Oh, you found a better link for the Opteron 4170 HE, thanks! I've adjusted the Squash page.

I'm running the benchmarks on some ARM boards right now and SHARC is not faring very well. BeagleBoard-xM is already up, Raspberry Pi and PandaBoard ES should be up tomorrow.

g1mv commented 11 years ago

Comment by gpnuma from Wednesday Aug 14, 2013 at 14:07 GMT

From @akdjka 's comment #7

Core2Duo, Debian Wheezy, gcc 4.7 -O3, custom test data (not very special, mix of different files). sharc 1cade84 tested with its own executable. LZ4 r97 tested with fsbench 0.14.2 codec c.ratio speed (MB/s) LZ4 1.75 233 sharc c0 1.47 245 sharc c1 1.70 161 Not sure if I'd call it an 'issue'.

It's not really an issue it's a possible optimization, as sharc -c1 is 30% faster on i5/i7 with an equivalent compression ratio.

g1mv commented 11 years ago

Comment by lesnake from Wednesday Aug 14, 2013 at 21:41 GMT

Here is squash benchmark result

plaform : core 2 duo T8100... virtualized Arch linux 32 bits

Pierre

g1mv commented 11 years ago

Comment by gpnuma from Wednesday Aug 14, 2013 at 23:11 GMT

You need the following javascript : http://quixdb.github.io/squash/benchmarks/benchmark.js to view the file properly. These results are really impressive though... SHARC 3 times faster than anything else !!? Sounds almost unreal ! To @akdjka : could you launch squash on your platform for a comparison ? Thanks

g1mv commented 11 years ago

Comment by nemequ from Wednesday Aug 14, 2013 at 23:29 GMT

You need the following javascript : http://quixdb.github.io/squash/benchmarks/benchmark.js to view the file properly.

I just modified Squash. For future benchmarks the JS is now included in the HTML. Also, the JSON is now readable.

g1mv commented 11 years ago

Comment by akdjka from Friday Aug 16, 2013 at 07:40 GMT

@gpnuma: Have troubles building it. Don't have time to debug now. OT: @nemequ: It's bad that the script depends on Google to do such a basic task...aside from the obvious issues with lack of internet connection; when I'm benching smth, it's not Google's business to know what I'm up to. Something like a csv would be much better IMHO.

g1mv commented 11 years ago

Comment by nemequ from Friday Aug 16, 2013 at 08:07 GMT

Have troubles building it. Don't have time to debug now.

If it's squash that is giving you trouble (not the dependencies, which is basically libltdl right now), please file an issue. I've only really tried compiling on Linux with GCC, so I would love to hear about issues on other platforms so I can try to eliminate them.

It's bad that the script depends on Google to do such a basic task...

As someone who has written a JavaScript charting library with many of the same features (multiple interactive views of the same data, live updates, animations, etc.), I can assure you it's not a basic task.

aside from the obvious issues with lack of internet connection; when I'm benching smth, it's not Google's business to know what I'm up to. Something like a csv would be much better IMHO.

Just look at the JSON, it's no worse than CSV. You don't even have to make the HTML if you don't want to--just do make data.json instead of make benchmark.html. Even if you do make the HTML, data.json is still generated as an intermediary file.

g1mv commented 11 years ago

Comment by akdjka from Friday Aug 16, 2013 at 10:23 GMT

The major advantage of csv is that you load it to your spreadsheet software and suddenly you get all the data processing and presentation tools that you need available with a few clicks. As someone who's done it many times, I can assure you that it is a basic task.

g1mv commented 11 years ago

Comment by lesnake from Friday Aug 16, 2013 at 12:23 GMT

Here is the benchmark with the javascripts embedded.

It does not seem that sharc has a performance issue on my low end core 2.

Pierre

g1mv commented 11 years ago

Comment by gpnuma from Friday Aug 16, 2013 at 12:42 GMT

The thing is, there is a major discrepancy between the enwik8 fileset (176253.82 KB/s) and the iliad fileset (70413.33 KB/s) , 2.5 times slower that is very difficult to explain as the file type cannot have such an impact on performance... This problem makes the results hard to be taken into account. Maybe there is an inconsistency in the benchmark ? I didn't look at the code but maybe @nemequ you can explain this ?

g1mv commented 11 years ago

Comment by nemequ from Friday Aug 16, 2013 at 18:27 GMT

The thing is, there is a major discrepancy between the enwik8 fileset (176253.82 KB/s) and the iliad fileset (70413.33 KB/s) , 2.5 times slower that is very difficult to explain as the file type cannot have such an impact on performance... This problem makes the results hard to be taken into account. Maybe there is an inconsistency in the benchmark ? I didn't look at the code but maybe @nemequ you can explain this ?

Just looked into this a bit. Obviously, with a smaller dataset setup/teardown is more of a factor. With that in mind, what I came up with is that sharc_resetDictionary is slow. I wrote a quick program to benchmark setup and teardown. With the calls to sharc_resetDictionary in place (squash-sharc-stream.c:102-103), 1000 setup/teardown iterations takes 2.11802 seconds. If I comment those two lines out, 0.0538754. FWIW, gzip takes 0.153816 seconds.

g1mv commented 11 years ago

Comment by nemequ from Friday Aug 16, 2013 at 19:13 GMT

The major advantage of csv is that you load it to your spreadsheet software and suddenly you get all the data processing and presentation tools that you need available with a few clicks.

Okay, I just added support for CSV. You'll have to run the benchmark manually, something like ./benchmark -f csv -o data.csv iliad enwik8, but it's there. If you have other issues with squash please use the squash issue tracker so we don't hijack this thread.

g1mv commented 11 years ago

Comment by gpnuma from Friday Aug 16, 2013 at 21:16 GMT

@nemequ great stuff thanks for that, I'll have a closer look and study this reset of dictionaries. On what platform were you running your program to test setup/teardown ?

Furthermore what's surprising though is that the compression throughput is the same on Core i5 (around 480MB/s) and roughly equivalent on AMD Opteron (220-180 MB/s) for iliad and enwik8, although the dictionary reset is also being run on those two platforms !!?

g1mv commented 11 years ago

Comment by nemequ from Friday Aug 16, 2013 at 22:35 GMT

@nemequ great stuff thanks for that, I'll have a closer look and study this reset of dictionaries. On what platform were you running your program to test setup/teardown ?

The Atom D525 which I gave you a shell account on, so you should be able to reproduce if you want.

Furthermore what's surprising though is that the compression throughput is the same on Core i5 (around 480MB/s) and roughly equivalent on AMD Opteron (220-180 MB/s) for iliad and enwik8, although the dictionary reset is also being run on those two platforms !!?

Yes, it's the same code. I just re-ran the benchmarks on the Opteron and Core i5-2400 to be sure (and added a Core i3-2105). I guess the relevant instructions generated are just implemented more efficiently there.

g1mv commented 11 years ago

Comment by gpnuma from Saturday Aug 17, 2013 at 17:10 GMT

Ok, for anyone that's interested I tried the following program : https://gist.github.com/gpnuma/6257717 The goal is to zero a block of memory containing 65536 8-byte entries, and that is repeated 100 000 times. The z value is created to make sure that despite having max optimizations, the compiler will not get rid of the tested code because its result won't be used afterwards. On a Core i7 with OSX here is the output :

$ gcc -O3 -std=gnu99 test.c $ time ./a.out NEUTRAL ADD z = 11841950658684580795

real 0m0.004s user 0m0.001s sys 0m0.002s

$ gcc -O3 -std=gnu99 -DMEMSET test.c $ time ./a.out MEMSET z = 0

real 0m2.265s user 0m2.262s sys 0m0.002s

$ gcc -O3 -std=gnu99 -DRESET test.c $ time ./a.out RESET z = 0

real 0m2.259s user 0m2.256s sys 0m0.002s

And on Atom D525 (thanks @nemequ for the ssh) :

$ gcc -O3 -std=gnu99 test.c $ time ./a.out NEUTRAL ADD z = 263705969750378474

real 0m0.011s user 0m0.010s sys 0m0.001s

$ gcc -O3 -std=gnu99 -DRESET test.c $ time ./a.out RESET z = 0

real 0m14.724s user 0m14.697s sys 0m0.005s

$ gcc -O3 -std=gnu99 -DMEMSET test.c $ time ./a.out MEMSET z = 0

real 0m14.668s user 0m14.655s sys 0m0.004s

Two conclusions to make :

the first one is that shark_dictionaryReset() and memset are both very well handled by the compiler and are running at exactly the same speed.
the second is that on these bulk copies the Atom was 7 times slower which is not really a surprise.

But overall, a dictionary reset is only taking .00014724 seconds on the Atom and 0.00002259 seconds (!) on the Core i7 which is very good for me : there is absolutely no optimization to make here as there are only a few resets going on during compression. Which made me still wonder why there is such a discrepancy between the iliad fileset and the enwik8 one ... So I tried them both on the Atom, and although there is no ramdisk the results are very consistent :

$ ./sharc -v Centaurean Sharc 0.9.8 Built for GNU/Linux (Little endian system, 64 bits) using GCC 4.8.1, Aug 16 2013 15:59:23

$ ./sharc -n enwik8 enwik8 enwik8 enwik8 enwik8 Compressed enwik8 to enwik8.sharc, 100000000 bytes in, 61841352 bytes out, Ratio out / in = 61.8%, Time = 1.671 s, Speed = 57 MB/s Compressed enwik8 to enwik8.sharc, 100000000 bytes in, 61841352 bytes out, Ratio out / in = 61.8%, Time = 1.637 s, Speed = 58 MB/s Compressed enwik8 to enwik8.sharc, 100000000 bytes in, 61841352 bytes out, Ratio out / in = 61.8%, Time = 1.645 s, Speed = 58 MB/s Compressed enwik8 to enwik8.sharc, 100000000 bytes in, 61841352 bytes out, Ratio out / in = 61.8%, Time = 1.659 s, Speed = 57 MB/s Compressed enwik8 to enwik8.sharc, 100000000 bytes in, 61841352 bytes out, Ratio out / in = 61.8%, Time = 1.655 s, Speed = 58 MB/s

$ ./sharc -n iliad iliad iliad iliad iliad Compressed iliad to iliad.sharc, 1308638 bytes in, 793152 bytes out, Ratio out / in = 60.6%, Time = 0.027 s, Speed = 47 MB/s Compressed iliad to iliad.sharc, 1308638 bytes in, 793152 bytes out, Ratio out / in = 60.6%, Time = 0.024 s, Speed = 52 MB/s Compressed iliad to iliad.sharc, 1308638 bytes in, 793152 bytes out, Ratio out / in = 60.6%, Time = 0.021 s, Speed = 60 MB/s Compressed iliad to iliad.sharc, 1308638 bytes in, 793152 bytes out, Ratio out / in = 60.6%, Time = 0.021 s, Speed = 61 MB/s Compressed iliad to iliad.sharc, 1308638 bytes in, 793152 bytes out, Ratio out / in = 60.6%, Time = 0.021 s, Speed = 60 MB/s

$ ./sharc -n -c1 enwik8 enwik8 enwik8 enwik8 enwik8 Compressed enwik8 to enwik8.sharc, 100000000 bytes in, 57421352 bytes out, Ratio out / in = 57.4%, Time = 2.757 s, Speed = 35 MB/s Compressed enwik8 to enwik8.sharc, 100000000 bytes in, 57421352 bytes out, Ratio out / in = 57.4%, Time = 2.571 s, Speed = 37 MB/s Compressed enwik8 to enwik8.sharc, 100000000 bytes in, 57421352 bytes out, Ratio out / in = 57.4%, Time = 2.636 s, Speed = 36 MB/s Compressed enwik8 to enwik8.sharc, 100000000 bytes in, 57421352 bytes out, Ratio out / in = 57.4%, Time = 2.577 s, Speed = 37 MB/s Compressed enwik8 to enwik8.sharc, 100000000 bytes in, 57421352 bytes out, Ratio out / in = 57.4%, Time = 2.577 s, Speed = 37 MB/s

$ ./sharc -n -c1 iliad iliad iliad iliad iliad Compressed iliad to iliad.sharc, 1308638 bytes in, 704280 bytes out, Ratio out / in = 53.8%, Time = 0.035 s, Speed = 36 MB/s Compressed iliad to iliad.sharc, 1308638 bytes in, 704280 bytes out, Ratio out / in = 53.8%, Time = 0.033 s, Speed = 38 MB/s Compressed iliad to iliad.sharc, 1308638 bytes in, 704280 bytes out, Ratio out / in = 53.8%, Time = 0.032 s, Speed = 38 MB/s Compressed iliad to iliad.sharc, 1308638 bytes in, 704280 bytes out, Ratio out / in = 53.8%, Time = 0.034 s, Speed = 37 MB/s Compressed iliad to iliad.sharc, 1308638 bytes in, 704280 bytes out, Ratio out / in = 53.8%, Time = 0.038 s, Speed = 33 MB/s

So clearly the file size is really not changing anything, I also checked how the timing is done in SHARC and it does include the dictionary resets... maybe there's something to do in the squash benchmark then ? It is also very clear that compression speed is not very good on the Intel Atom :-)

g1mv commented 11 years ago

Comment by nemequ from Saturday Aug 17, 2013 at 19:43 GMT

Which made me still wonder why there is such a discrepancy between the iliad fileset and the enwik8 one ... So I tried them both on the Atom, and although there is no ramdisk the results are very consistent :

They are pretty consistent in the squash benchmark on the Atom D525, too. The problem is the Core 2 Duo T8100 VM (with a 32-bit Arch guest OS), which I don't have access to. All I can say is that the sharc_resetDictionary calls dominate the time it takes to setup the stream, at least on an Atom D525.

It is also very clear that compression speed is not very good on the Intel Atom :-)

Yes, and I'd love to see results on a newer Atom (i.e., Clover Trail+), but TBH I'm more disappointed by the performance on ARM. A lot of stuff these days is moving towards low-power architectures like ARM (and, Intel hopes, Atom), but it seems like research on compression algorithms has pretty much ignored that area—the best speeds seem to come from LZO, which was developed in (IIRC) the late nineties.

g1mv commented 11 years ago

Comment by gpnuma from Saturday Aug 17, 2013 at 23:06 GMT

Which made me still wonder why there is such a discrepancy between the iliad fileset and the enwik8 one ... So I tried them both on the Atom, and although there is no ramdisk the results are very consistent :

They are pretty consistent in the squash benchmark on the Atom D525, too. The problem is the Core 2 Duo T8100 VM (with a 32-bit Arch guest OS), which I don't have access to. All I can say is that the sharc_resetDictionary calls dominate the time it takes to setup the stream, at least on an Atom D525.

You're absolutely right on that ... @lesnake would it be possible for you to relaunch the bench to confirm the numbers again ?

Yes, and I'd love to see results on a newer Atom (i.e., Clover Trail+), but TBH I'm more disappointed by the performance on ARM. A lot of stuff these days is moving towards low-power architectures like ARM (and, Intel hopes, Atom), but it seems like research on compression algorithms has pretty much ignored that area—the best speeds seem to come from LZO, which was developed in (IIRC) the late nineties.

That's right but maybe LZO is in fact better on ARM because it was developed for 1990's processors ! Anyways SHARC aims to be faster on as many platforms as possible so we're not giving up on that and it's actually the main reason for this thread :smiley:

g1mv commented 11 years ago

Comment by lesnake from Sunday Aug 18, 2013 at 13:11 GMT

Once again, same results

All antivirus, networks unplugged, power savings off. Just windows screensaver started while in xz compression.

lesnake

g1mv commented 11 years ago

Comment by lesnake from Sunday Aug 18, 2013 at 17:14 GMT

I just ran squash using live ubuntu session on the same machine. Results are quite disappointing...

The effects of virtualization are un-equal from one program to another !!

Here you can see the ratio (virtualized speed/ native speed) for each program. enwik8 virtual vs native illiad virtual vs native

As an example : Sharc is 3 times slower on illiad on a virtualized platform than on a native machine, but so as lz4 decompression (also on enwik 8), which we didn't pay attention to.

Fun fact : lzo on enwik8 shows that decompression is faster when virtualized when using that technique for testing.

So it seems to be interesting to take into account the impact of virtualization on tests.

lesnake

g1mv commented 11 years ago

Comment by nemequ from Sunday Aug 18, 2013 at 18:28 GMT

@lesnake, what virtualization technology are you using? The Core 2 Duo T8100 has VT-x, do you have it enabled? Sometimes you have to go into BIOS and flip a switch...

The Opteron benchmark on the Squash page is also virtualized, although I don't have access to the hardware so I can't compare to the non-virtualized version.

g1mv commented 11 years ago

Comment by lesnake from Sunday Aug 18, 2013 at 18:33 GMT

I am using latest virtualbox, w/o guest additions. VT-x is enabled, both in bios and virtualbox.

g1mv commented 11 years ago

Comment by gpnuma from Thursday Aug 22, 2013 at 11:23 GMT

A quick update on this thread as significant progress has been made on low end processors like the Atom on the latest commit 1bdb1aba97e259d8c1b5015e1d349bf90d066682. By reducing the size of some critical memory objects they now fit better on low Ln processor caches, and they can even be completely contained by lower sized caches like the Atom one (1M). Since objects are smaller, one could think that cache line invalidations happen more often on memory writes - it is the case, but the overall performance is nevertheless much better : compression is about 10% faster on all platforms tested, with a high of 30% faster on the Atom.

These modifications do not affect - or probably not noticeably - the mobile ARM platforms, for which further work is required.

g1mv commented 11 years ago

Comment by nemequ from Friday Aug 23, 2013 at 08:13 GMT

These modifications do not affect - or probably not noticeably - the mobile ARM platforms, for which further work is required.

Actually, I'm seeing performance improvements comparable to Atom on the ARM platforms I'm testing (BeagleBoard-xM, PandaBoard ES, Raspberry Pi). Here is the diff where I updated the benchmarks: https://github.com/quixdb/squash/commit/a73cc4013ce2877b0d32fb4d26820e4d0c070f8a (there is a bit of noise since I also added a couple of codecs, but it's easy to sort through).

g1mv commented 11 years ago

Comment by gpnuma from Friday Aug 23, 2013 at 11:20 GMT

Interesting ! It's most certainly due to the fact that although the L caches are much smaller on ARM, a greater quantity of distinct data can hold. The cache mechanism must be quite efficient as well, to juggle with the extremely frequent memory-to-cache and cache-to-memory swaps due to the whole of the data not fitting in the cache, but since cache line invalidations are extremely frequent in the algorithm this behavior also takes place on every platform, at a slightly lower rate probably. But overall, I find the Atom performance is now more or less acceptable compared to the other x86_64 processors, but in my opinion ARM and other similar relatively-low-cache processors will need further work to benefit from the same performance edge.

g1mv commented 11 years ago

Comment by lesnake from Saturday Aug 24, 2013 at 15:54 GMT

x86 : Core i7 : cache size : 6 MB Core I5 2400 spec ; cache size : 6MB AMD 4170 HE : cache size : 6MB Core I3-2105 : cache size : 3 MB Core 2 Duo T8100 : cache size : 3 MB Atom d252 spec : cache size : 1MB ARM : beagleboard-xm : code/data cache 32kB, L2 : 256kB beagleboard : code 32kB data 80kB L2 64kB raspberry pi : 256 KB L2-cache

Embedded processor graphic accelerators seem to be able to access the cache directely, thus may lower the cache available for the cpu. cat /proc/cpuinfo may give proper values (?)

g1mv commented 6 years ago

Has to be tested again with 0.14.0

g1mv / density

Performance drop on low end / mobile architectures #2