dcwatson / deflate

Python extension wrapper for libdeflate.
MIT License
25 stars 6 forks source link

benchmark deflate.crc32 against zlib.crc32 #21

Closed ThomasWaldmann closed 2 years ago

ThomasWaldmann commented 2 years ago

Note: I put the OPS (operations per second, as output by pytest-benchmark) here to have absolute numbers.

now it gets interesting:

so, on linux and x64, performance is as expected (libdeflate highly optimized, zlib not optimized).

on macOS Intel, it seems that CPU acceleration is not used and not much sw optimization either (by both libdeflate and zlib).

on macOS M1 (Apple Silicon), performance isn't as good as expected. here it seems zlib is optimized well and libdeflate only a little.

is there something wrong with how the deflate code is built?

ThomasWaldmann commented 2 years ago

btw, i also tried libdeflate v1.10 (instead of v1.8), but it didn't speed up M1.

ThomasWaldmann commented 2 years ago

@dcwatson @braewoods can you have a look?

ghost commented 2 years ago

What are you using to build libdeflate with? The x64 speed ups should work on macOS x64 at least but are probably getting disabled by the compiler.

ghost commented 2 years ago

Looks like the code will disable the enhanced versions at runtime if the cpu does not support pclmul or pcmul AND avx. The other possibility is the code isn't getting compiled into the library due to the macros not being satisfied by the platform and/or compiler. I can't really look further due to lacking access to macOS.

ghost commented 2 years ago

In any case it may imply there's a bug in libdeflate that is not enabling this enhancement on platforms that can use it.

ThomasWaldmann commented 2 years ago

i just did pip install -e . in the toplevel dir of this project. guess it uses clang to compile. macOS with M1 cpu, not x64.

ghost commented 2 years ago

Oh, GitHub CI. I missed that. Why it's acting like this on ARM I can't say right now. It may be the ARM optimizations are no good.

ghost commented 2 years ago

How can I access the GitHub CI program to get some environment data? I need to see what macros are getting defined to trace the source code.

ThomasWaldmann commented 2 years ago

https://github.com/dcwatson/deflate/runs/5365821999?check_suite_focus=true there is some stuff in the logs (you can expand the misc. steps), but guess if you need more, you could only open a PR and let it execute some custom code you put into the project.

ghost commented 2 years ago

Wait, I just thought of something. How do we know the macOS zlib isn't patched with the same optimizations that is being used by Chromium's bundled zlib? That might explain the odd behavior. I know Ubuntu's zlib does not have the enhanced crc32 algorithms for sure.

ThomasWaldmann commented 2 years ago

I'll add the OPS (operations per second) in my post above.

ghost commented 2 years ago

I'm going to try using a custom version of libdeflate that will spit out warnings when the CI compiles the optimized x64 code. If those do not appear it would tell me they're not even getting compiled in.

ThomasWaldmann commented 2 years ago

libdeflate/arm/cpu_features.h works only for linux. :-(

this seems better (assuming that it works):

https://github.com/zlib-ng/zlib-ng/blob/develop/arch/arm/arm_features.c

ghost commented 2 years ago

We may be able to fix that. Right now I'm investigating the x86 MacOS situation.

ghost commented 2 years ago

Well first tests come back. The MacOS x64 code is getting compiled in. Next test is to find out if the optimized version is being used at runtime.

ThomasWaldmann commented 2 years ago

Guess I'll merge this PR, so it is easier available in master.

Guess most of what we found out / find out needs to go into libdeflate issue tracker.

ghost commented 2 years ago

Ok I think I have an explanation @ThomasWaldmann for what's going on now. On MacOS, the python zlib module is already optimized for crc32. This seems to be the case for both x64 and ARM (M1). libdeflate performs the same on x64 and worse on ARM due to broken or absent ARM optimizations. So it seems libdeflate has some issues on MacOS but works as expected on Linux which only uses vanilla zlib typically so it lacks the optimized crc32. As a workaround we could switch to the zlib crc32 for MacOS and use libdeflate everywhere else, at least until the crc32 function is fixed for libdeflate.

ThomasWaldmann commented 2 years ago

Would be cool to have macOS fixed, for both Intel (which is getting phased out by Apple) and Apple Silicon / M1.

See there what i found elsewhere: https://github.com/dcwatson/deflate/pull/21#issuecomment-1054758030

ghost commented 2 years ago

I don't believe it needs fixing on Intel. My tests show the optimized version is being included on MacOS Intel. I assume the python in use is using a zlib with optimized crc32 which would explain why it is roughly equivalent in performance.

ThomasWaldmann commented 2 years ago

Well, if macOS Intel would use CPU level hw optimizations, shouldn't it then get similar performance as Linux?

ghost commented 2 years ago

Not necessarily. The GitHub MacOS runners use different x64 hardware than the Linux or Windows runners. See here: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners

ghost commented 2 years ago

Since they run off different cloud infrastructure I don't think we can really compare the results from different operating systems even if the CPU architecture is the same. The underlying hardware could be totally different performance wise.

ghost commented 2 years ago

Well one way to settle this would be to test vanilla zlib against libdeflate and remove python zlib from the equation since it may be using an optimized version, skewing the results.

ghost commented 2 years ago

I'll take a closer look. Now that I look at the raw numbers, something doesn't add up. Even thought my tests show the optimized version is in use, something seems fishy.

ghost commented 2 years ago

My hypothesis still seems plausible. I tried forcing which variant of the crc32 function was in use and these are the results from the MacOS workers. The data is suggesting that the zlib implementation in use is more optimized than the libdeflate implementation. I don't know how else to explain the odd results.

Force default:
test_zlib_crc32        3.0582 (1.0)      0.0668 (1.0)      3.0461 (1.0)      326.9900 (1.0)    
test_deflate_crc32     6.0706 (1.99)     0.1012 (1.51)     6.0545 (1.99)     164.7278 (0.50)   

Force pclmul:
test_zlib_crc32        3.0957 (1.0)      0.0938 (1.62)     3.0807 (1.0)      323.0259 (1.0)    
test_deflate_crc32     3.2347 (1.04)     0.0578 (1.0)      3.2303 (1.05)     309.1519 (0.96)   

Force pclmul_avx:
test_zlib_crc32        3.1103 (1.0)      0.2151 (1.34)     3.0530 (1.0)      321.5145 (1.0)    
test_deflate_crc32     3.3284 (1.07)     0.1608 (1.0)      3.2771 (1.07)     300.4435 (0.93) 
ghost commented 2 years ago

For Borg perhaps all we can do in the short term is to use the zlib crc32 implementation on all MacOS targets since it seems to be superior in all cases. libdeflate crc32 could be used everywhere else.

dcwatson commented 2 years ago

I fiddled with libdeflate a bit to try the different ARM CRC32 implementations on my M1. It seems like crc32_pmull is faster than crc32_arm, but both are slower than the zlib.crc32 implementation anyway. I also tried linking to a Homebrew-installed libdeflate, which zlib still beat handily. So I would agree with @braewoods that your best bet is to use zlib's implementation on macOS. I definitely don't think this wrapper should be in the business of trying to determine the fastest CRC32 implementation per machine.

ebiggers commented 2 years ago

Please try libdeflate v1.12. The performance of libdeflate_crc32() on Apple M1 has improved by about 8x and is significantly better than the Apple-provided zlib now.

ThomasWaldmann commented 2 years ago

@ebiggers cool, will re-benchmark it as soon as it lands in homebrew.

ThomasWaldmann commented 2 years ago

@ebiggers can confirm, great work! this benchmark in on macOS M1:

(borg-env) tw@mba2020 borg % borg benchmark cpu
Non-cryptographic checksums / hashes ===========================
crc32 (zlib, used)       1GB        0.055s
crc32 (libdeflate)       1GB        0.027s
xxh64                    1GB        0.122s