Closed ThomasWaldmann closed 2 years ago
btw, i also tried libdeflate v1.10 (instead of v1.8), but it didn't speed up M1.
@dcwatson @braewoods can you have a look?
What are you using to build libdeflate with? The x64 speed ups should work on macOS x64 at least but are probably getting disabled by the compiler.
Looks like the code will disable the enhanced versions at runtime if the cpu does not support pclmul or pcmul AND avx. The other possibility is the code isn't getting compiled into the library due to the macros not being satisfied by the platform and/or compiler. I can't really look further due to lacking access to macOS.
In any case it may imply there's a bug in libdeflate that is not enabling this enhancement on platforms that can use it.
i just did pip install -e .
in the toplevel dir of this project. guess it uses clang
to compile. macOS with M1 cpu, not x64.
Oh, GitHub CI. I missed that. Why it's acting like this on ARM I can't say right now. It may be the ARM optimizations are no good.
How can I access the GitHub CI program to get some environment data? I need to see what macros are getting defined to trace the source code.
https://github.com/dcwatson/deflate/runs/5365821999?check_suite_focus=true there is some stuff in the logs (you can expand the misc. steps), but guess if you need more, you could only open a PR and let it execute some custom code you put into the project.
Wait, I just thought of something. How do we know the macOS zlib isn't patched with the same optimizations that is being used by Chromium's bundled zlib? That might explain the odd behavior. I know Ubuntu's zlib does not have the enhanced crc32 algorithms for sure.
I'll add the OPS (operations per second) in my post above.
I'm going to try using a custom version of libdeflate that will spit out warnings when the CI compiles the optimized x64 code. If those do not appear it would tell me they're not even getting compiled in.
libdeflate/arm/cpu_features.h
works only for linux. :-(
this seems better (assuming that it works):
https://github.com/zlib-ng/zlib-ng/blob/develop/arch/arm/arm_features.c
We may be able to fix that. Right now I'm investigating the x86 MacOS situation.
Well first tests come back. The MacOS x64 code is getting compiled in. Next test is to find out if the optimized version is being used at runtime.
Guess I'll merge this PR, so it is easier available in master.
Guess most of what we found out / find out needs to go into libdeflate issue tracker.
Ok I think I have an explanation @ThomasWaldmann for what's going on now. On MacOS, the python zlib module is already optimized for crc32. This seems to be the case for both x64 and ARM (M1). libdeflate performs the same on x64 and worse on ARM due to broken or absent ARM optimizations. So it seems libdeflate has some issues on MacOS but works as expected on Linux which only uses vanilla zlib typically so it lacks the optimized crc32. As a workaround we could switch to the zlib crc32 for MacOS and use libdeflate everywhere else, at least until the crc32 function is fixed for libdeflate.
Would be cool to have macOS fixed, for both Intel (which is getting phased out by Apple) and Apple Silicon / M1.
See there what i found elsewhere: https://github.com/dcwatson/deflate/pull/21#issuecomment-1054758030
I don't believe it needs fixing on Intel. My tests show the optimized version is being included on MacOS Intel. I assume the python in use is using a zlib with optimized crc32 which would explain why it is roughly equivalent in performance.
Well, if macOS Intel would use CPU level hw optimizations, shouldn't it then get similar performance as Linux?
Not necessarily. The GitHub MacOS runners use different x64 hardware than the Linux or Windows runners. See here: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners
Since they run off different cloud infrastructure I don't think we can really compare the results from different operating systems even if the CPU architecture is the same. The underlying hardware could be totally different performance wise.
Well one way to settle this would be to test vanilla zlib against libdeflate and remove python zlib from the equation since it may be using an optimized version, skewing the results.
I'll take a closer look. Now that I look at the raw numbers, something doesn't add up. Even thought my tests show the optimized version is in use, something seems fishy.
My hypothesis still seems plausible. I tried forcing which variant of the crc32 function was in use and these are the results from the MacOS workers. The data is suggesting that the zlib implementation in use is more optimized than the libdeflate implementation. I don't know how else to explain the odd results.
Force default:
test_zlib_crc32 3.0582 (1.0) 0.0668 (1.0) 3.0461 (1.0) 326.9900 (1.0)
test_deflate_crc32 6.0706 (1.99) 0.1012 (1.51) 6.0545 (1.99) 164.7278 (0.50)
Force pclmul:
test_zlib_crc32 3.0957 (1.0) 0.0938 (1.62) 3.0807 (1.0) 323.0259 (1.0)
test_deflate_crc32 3.2347 (1.04) 0.0578 (1.0) 3.2303 (1.05) 309.1519 (0.96)
Force pclmul_avx:
test_zlib_crc32 3.1103 (1.0) 0.2151 (1.34) 3.0530 (1.0) 321.5145 (1.0)
test_deflate_crc32 3.3284 (1.07) 0.1608 (1.0) 3.2771 (1.07) 300.4435 (0.93)
For Borg perhaps all we can do in the short term is to use the zlib crc32 implementation on all MacOS targets since it seems to be superior in all cases. libdeflate crc32 could be used everywhere else.
I fiddled with libdeflate a bit to try the different ARM CRC32 implementations on my M1. It seems like crc32_pmull is faster than crc32_arm, but both are slower than the zlib.crc32 implementation anyway. I also tried linking to a Homebrew-installed libdeflate, which zlib still beat handily. So I would agree with @braewoods that your best bet is to use zlib's implementation on macOS. I definitely don't think this wrapper should be in the business of trying to determine the fastest CRC32 implementation per machine.
Please try libdeflate v1.12. The performance of libdeflate_crc32()
on Apple M1 has improved by about 8x and is significantly better than the Apple-provided zlib now.
@ebiggers cool, will re-benchmark it as soon as it lands in homebrew.
@ebiggers can confirm, great work! this benchmark in on macOS M1:
(borg-env) tw@mba2020 borg % borg benchmark cpu
Non-cryptographic checksums / hashes ===========================
crc32 (zlib, used) 1GB 0.055s
crc32 (libdeflate) 1GB 0.027s
xxh64 1GB 0.122s
Note: I put the OPS (operations per second, as output by pytest-benchmark) here to have absolute numbers.
now it gets interesting:
deflate.crc32
450 OPS vs. 1760 OPSzlib.crc32
(python stdlib).deflate.crc32
220 OPS vs. 250 OPSzlib.crc32
(python stdlib).deflate.crc32
1420 OPS vs. 95 OPSzlib.crc32
(python stdlib).so, on linux and x64, performance is as expected (libdeflate highly optimized, zlib not optimized).
on macOS Intel, it seems that CPU acceleration is not used and not much sw optimization either (by both libdeflate and zlib).
on macOS M1 (Apple Silicon), performance isn't as good as expected. here it seems zlib is optimized well and libdeflate only a little.
is there something wrong with how the deflate code is built?