Closed neurolabusc closed 4 years ago
Thanks for the detailed proposal!
Here are results for dual socket Cavium ThunderX CN8890 (2 * 48 cores) and single socket Ampere eMAG 8180 (32 cores). The CloudFlare and zlib-ng already include SIMD ARM support. Earlier work with CloudFlare on ARM is described here. I updated the CMake scripts to build for ARM. At the moment, the CMake files for CloudFlare assume SIMD CRC. We could hook in the Chrome detection, but I am starting to wonder if many people would realistically target ARM CPUs that do not have these instructions.
The older Cavium has slow SIMD, while the Ampere is pretty competitive:
#Cavium ThunderX CN8890 2Ghz
gcc -std=c11 -O3 -o tst crc32.c zutil.c test.c -march=armv8-a+crc
./tst
Conversion required 1.610013 seconds.
CRC= 1680726628
#Without crc intrinsics:
Conversion required 2.686161 seconds.seconds.
CRC= 1680726628
#Ampere eMAG 8180 32 cores at 3 Ghz
gcc -std=c11 -O3 -o tst crc32.c zutil.c test.c -march=armv8-a+crc
./tst
Conversion required 0.167049 seconds.
CRC= 1680726628
#without crc32d intrinsics:
Conversion required 1.208400 seconds
CRC= 1680726628
One challenge of these graphs is they show acceleration with more cores, relative to single core baseline. Looking just at single core performance for the CloudFlare at the default compression level (6) we can see the following relative performance:
Intel Xeon Platinum 8260 33.241s AWS Graviton 91.725s Ampere eMAG 8180 98.152s Cavium ThunderX CN8890 232.170s
@neurolabusc that is pretty interesting (and thanks for sharing the charts).
To clarify, you are profiling the time spend doing compression?
I must assume that decompression is not important within your use case?
@Adenilson the design of the gz format does not lend itself to parallel decompression. Therefore, for multiple testing multiple cores, I really focused on compression times. At typical compression levels (e.g. 6), Gzip is much slower to compress than decompress. A modern format like zstd is MUCH faster at decompression than Gzip. Obviously, for situations like sharing images on the web, a file might be compressed just once on upload, and downloaded many times. In this case, using an extremely slow variant of Gzip like Brotli makes sense. In my field (neuroimaging) we read GZ NIfTI images at the beginning of each stage of processing and write a GZ NIfTI at the end of each stage (e.g. FSL, AFNI). Likewise, my dcm2niix project reads typically raw DICOM data and writes GZ data. Therefore, due to the fact that we have a one-to-one read/write ratio, the fact that GZ is much slower to write, and the fact that GZ writing can be done in parallel made this the focus of this exercise.
Feel free to extend the pigz-bench to look at decompression. The 4verify.sh
script could easily be extended to check speed, the current design was just to ensure accuracy.
@Adenilson I have added a decompression benchmark 5decompress.sh. For using on ARM, one needs to comment out the lines in 1compile.sh that create the Intel zlib (which does not support ARM). This benchmark attempts to address criticisms by @sebpop regarding testing decompression. Specifically, each method tested contributes compressed files (at different compression levels) to the compressed corpus, so all methods are benchmarked decompressing the same files. On x86-64, the zlib-ng does really well, reflecting their focus on accelerating decompression. In contrast, CloudFlare developed their fork of zlib for their use case of compressing data on the server side.
I am puzzled by the eMAG performance. In particular as the gzip provided by the system outperforms pigz compiled against the system zlib. I would have thought the latter is always faster, as it is able to compute the CRC on a separate thread.
Method | i5-8259U | eMAG 8180 | CN8890 |
---|---|---|---|
pigz-CloudFlare | 1675 | 4424 | 2944 |
pigz-ng | 1335 | 5149 | 3378 |
pigz-System | 1889 | 5576 | 3506 |
gzip | 1925 | 3496 | 6672 |
For x86-64, I think @gdevenyi may appreciate this as he has been an advocate of zlib-ng while I have preferred CloudFlare. I guess it depends on the metric you use. The corpus is 343281792 bytes uncompressed, emphasizing the fact that the popular but old gzip format has poor decompression performance relative to the modern zstd.
As an aside, I also tested my neuroimaging tools on these systems computing a 3D Gaussian blur on a huge 4D dataset (niimath rest -s 2.548 out). Similar to many computations in my field, due to memory and disk I/O demands, this test only shows modest benefits from OpenMP. This is I/O and FP heavy, and outside the traditional strengths of ARM chips (though future generations promise big changes):
CPU | 1 Core | All Cores |
---|---|---|
Ryzen 3900X | 29286 | 12424 |
eMAG 8180 | 87640 | 30700 |
CN8890 | 675600 | 54120 |
For x86-64, I think @gdevenyi may appreciate this as he has been an advocate of zlib-ng while I have preferred CloudFlare. I guess it depends on the metric you use.
Interesting results! From my perspective my advocacy for zlib-ng has been their open active github community and attention to cleaning up both the code and build system, and testing on a huge number of hardware platforms. I expect eventually they will integrate all performance enhancers from the various forks to reach parity, but will have a modern codebase and build system as well.
@neurolabusc -
I see that you have two systems that are reserved but turned off. In the interests of maximizing use, I'd like to either reclaim them (and return them to other people as needed) or reserve them for you (to make sure you have them in the future).
Either way is fine - or just one system would be fine - but the "powered down" state is the least useful, and I want to make sure you have the resources you need in case e.g. https://github.com/madler/pigz/pull/77 takes a long time to resolve.
Let me know your thoughts.
@vielmetti thanks for setting this up. I would suggest you reclaim them. I had detected a problem with zlib-ng and a pull request to fix that was submitted 3 weeks ago. I had originally hoped to validate that all tests pass, but I am not sure how quickly pull requests are incorporated for that library.
Regardless, I now have a simple script for validating the outcome, so you or someone else with access to an ARM system could rapidly test this once the patch is submitted.
@neurolabusc Thank you! Instances have been reclaimed, and we'll look forward to any published results and ultimately to a merged PR.
Daniel Lemire and Chris Rorden (email in avatar)
Improved zlib/pigz on ARM
This is not a novel solution, but could have a lot of impact. While there are more modern compression formats like zstd, the gz files created by zlib (among other libraries) remains popular. In my own field, it is the basis for compression of the NIfTI format. More widely, used in PNG image format. The compression can be run in parallel by tools like pigz. However, gz files require a CRC that must be calculated in serial. Due to Amdahl's law, this proves a rate limiting factor. The Cloudflare zlib uses SIMD instructions on x86-64 computers to accelerate CRC (as well as other tweaks) to [dramatically boost performance].
This would benefit any developers using zlib for compression and users using gzip.
The code we are working on is 100% open source
What infrastructure (computing resources and network access) do you need?
Different ARM clusters to test cpu detection.
Describe the continuous integration (CI) system in use or desired for this project.
See dcm2niix for Travis/AppVeyor compilation and validation.
Contributions to the open source community and any other relevant initiatives