dcwatson / deflate

Python extension wrapper for libdeflate.
MIT License
25 stars 6 forks source link

performance issues #23

Closed ThomasWaldmann closed 1 year ago

ThomasWaldmann commented 2 years ago

linux seems good, macOS (x64, Intel) mediocre, macOS (M1, Apple Silicon) the worst.

See there: #21

TODO: move insights from there to issues (guess best place is not here, but in libdeflate's issue tracker.

ghost commented 2 years ago

Can you try forcing the ARM cpu_features.h file to be enabled for your M1 environment? I wonder if that would change anything.

ghost commented 2 years ago

Since I lack access to any MacOS M1 environment, it's not like I can test anything like this.

ThomasWaldmann commented 2 years ago

That wouldn't help, the code there is linux specific. But I guess best is to continue in libdeflate's issue tracker and point them to the zlib-ng code i found.

ghost commented 2 years ago

Ok. I can't say I'm surprised. For a long time Linux was the only serious ARM target of note.

ThomasWaldmann commented 2 years ago

I was curious about how borgbackup's currently bundled crc32 code performs on macOS 12 with M1 cpu (again on my local machine):

Name (time in us)                Mean              StdDev                Median                   OPS          
---------------------------------------------------------------------------------------------------------------
test_zlib_crc32              560.6912 (1.0)       14.5614 (1.0)        563.9375 (1.0)      1,783.5129 (1.0)    
test_borg_crc32_slice8     7,326.7399 (13.07)    117.9650 (8.10)     7,324.9590 (12.99)      136.4864 (0.08)   

have_clmul is False, thus borg_crc32_clmul is not available (only implemented on x64 within the code currently bundled into borg).

ThomasWaldmann commented 2 years ago

Benchmarks done on github CI - (linux, x64):

Name (time in us)                Mean              StdDev                Median                   OPS          
---------------------------------------------------------------------------------------------------------------
test_borg_crc32_clmul        515.9855 (1.0)       19.2178 (1.0)        520.4060 (1.0)      1,938.0391 (1.0)    
test_borg_crc32_slice8     3,958.2522 (7.67)      84.5450 (4.40)     3,973.1480 (7.63)       252.6368 (0.13)   
test_zlib_crc32            7,500.5678 (14.54)    116.3165 (6.05)     7,520.1550 (14.45)      133.3232 (0.07)

Benchmarks done on github CI - (macOS, x64):

Name (time in ms)            Mean            StdDev            Median                 OPS          
---------------------------------------------------------------------------------------------------
test_zlib_crc32            3.1880 (1.0)      0.39 (5.85)       3.0442 (1.0)      313.6777 (1.0)    
test_borg_crc32_slice8     4.6606 (1.46)     0.0656 (1.0)      4.6442 (1.53)     214.5655 (0.68)
ThomasWaldmann commented 2 years ago

code: https://github.com/borgbackup/borg/pull/6387 - it would also benchmark deflate.crc32 as soon is that is in a pypi release.

ghost commented 2 years ago

It makes me wonder how libdeflate would fair against zlib-ng. That might explain why Python on MacOS is so different. Whichever version is in active use may be using zlib-ng instead of regular zlib. If so, should we just import the zlib-ng code since it may be doing better than libdeflate?

ThomasWaldmann commented 2 years ago

yeah, zlib-ng definitely also worth testing (but maybe a little bit off-topic here).

ThomasWaldmann commented 2 years ago

Updated performance results using libdeflate 1.12 on macOS M1:

(borg-env) tw@mba2020 borg % borg benchmark cpu
Non-cryptographic checksums / hashes ===========================
crc32 (zlib, used)       1GB        0.055s
crc32 (libdeflate)       1GB        0.027s
xxh64                    1GB        0.122s

Great update, it used to be slower, but now libdeflate 1.12 is twice as fast as zlib crc32 on macOS M1!

ThomasWaldmann commented 1 year ago

guess this is solved by the new libdeflate.