Improved zlib/pigz on ARM

neurolabusc commented 4 years ago

Daniel Lemire and Chris Rorden (email in avatar)

Improved zlib/pigz on ARM

This is not a novel solution, but could have a lot of impact. While there are more modern compression formats like zstd, the gz files created by zlib (among other libraries) remains popular. In my own field, it is the basis for compression of the NIfTI format. More widely, used in PNG image format. The compression can be run in parallel by tools like pigz. However, gz files require a CRC that must be calculated in serial. Due to Amdahl's law, this proves a rate limiting factor. The Cloudflare zlib uses SIMD instructions on x86-64 computers to accelerate CRC (as well as other tweaks) to [dramatically boost performance].

The developer of CloudFlare Zlib has described accelerated CRC for ARM.
The Chrome browser includes open source code for both ARM SIMD CRC as well as detection code. Here we propose validating this solution across as many ARM platforms as possible.
Our Cmake files allow users to pick-and-choose zlib libraries. This would easy deployment, and allow users to select the best option. This allows many users to test our solutions on their platforms prior to merging into mature but slower zlib repositories.
@Adenilson notes benefits and describes ARM opportunities beyond CRC.

This would benefit any developers using zlib for compression and users using gzip.

The code we are working on is 100% open source

Minimal test project showing benefits and validating CPU feature detection
1. Enhance CloudFlare zlib with successful solution applied as pull request to main project.
2. Update pigz cmake scripts to use enhanced libraries, with successful solution applied as main project pull request.
3. Validate performance benefits

What infrastructure (computing resources and network access) do you need?

Different ARM clusters to test cpu detection.

Describe the continuous integration (CI) system in use or desired for this project.

See dcm2niix for Travis/AppVeyor compilation and validation.

Contributions to the open source community and any other relevant initiatives

dcm2niix converts complex DICOM images (a standard interpreted differently by each manufacturer) to the simple NIfTI format. Widely used by Neuroimaging and Machine Learning scientists. Since NIfTI images often use GZ compression, the proposed project directly impacts this project.
- MRIcroGL and MRIcron provide volume rendering for brain imaging data using unique GLSL/Metal shaders. Have proved very popular.
- Surfice mesh rendering, useful for brain surfaces, connectomes and tractography.

vielmetti commented 4 years ago

Thanks for the detailed proposal!

neurolabusc commented 4 years ago

Here are results for dual socket Cavium ThunderX CN8890 (2 * 48 cores) and single socket Ampere eMAG 8180 (32 cores). The CloudFlare and zlib-ng already include SIMD ARM support. Earlier work with CloudFlare on ARM is described here. I updated the CMake scripts to build for ARM. At the moment, the CMake files for CloudFlare assume SIMD CRC. We could hook in the Chrome detection, but I am starting to wonder if many people would realistically target ARM CPUs that do not have these instructions.

The older Cavium has slow SIMD, while the Ampere is pretty competitive:

#Cavium ThunderX CN8890 2Ghz 
gcc -std=c11 -O3 -o tst  crc32.c zutil.c  test.c -march=armv8-a+crc
./tst
Conversion required 1.610013 seconds.
CRC= 1680726628
#Without crc intrinsics:
Conversion required 2.686161 seconds.seconds.
CRC= 1680726628
#Ampere eMAG 8180 32 cores at 3 Ghz
gcc -std=c11 -O3 -o tst  crc32.c zutil.c  test.c -march=armv8-a+crc
./tst
Conversion required 0.167049 seconds.
CRC= 1680726628
#without crc32d intrinsics:
Conversion required 1.208400 seconds
CRC= 1680726628

One challenge of these graphs is they show acceleration with more cores, relative to single core baseline. Looking just at single core performance for the CloudFlare at the default compression level (6) we can see the following relative performance:

Intel Xeon Platinum 8260 33.241s AWS Graviton 91.725s Ampere eMAG 8180 98.152s Cavium ThunderX CN8890 232.170s

ThunderX_CN8890 eMAG_8180

showplot_ThunderX_CN8890.py.txt showplot_eMAG_8180.py.txt

Adenilson commented 4 years ago

@neurolabusc that is pretty interesting (and thanks for sharing the charts).

To clarify, you are profiling the time spend doing compression?

I must assume that decompression is not important within your use case?

neurolabusc commented 4 years ago

@Adenilson the design of the gz format does not lend itself to parallel decompression. Therefore, for multiple testing multiple cores, I really focused on compression times. At typical compression levels (e.g. 6), Gzip is much slower to compress than decompress. A modern format like zstd is MUCH faster at decompression than Gzip. Obviously, for situations like sharing images on the web, a file might be compressed just once on upload, and downloaded many times. In this case, using an extremely slow variant of Gzip like Brotli makes sense. In my field (neuroimaging) we read GZ NIfTI images at the beginning of each stage of processing and write a GZ NIfTI at the end of each stage (e.g. FSL, AFNI). Likewise, my dcm2niix project reads typically raw DICOM data and writes GZ data. Therefore, due to the fact that we have a one-to-one read/write ratio, the fact that GZ is much slower to write, and the fact that GZ writing can be done in parallel made this the focus of this exercise.

Feel free to extend the pigz-bench to look at decompression. The 4verify.sh script could easily be extended to check speed, the current design was just to ensure accuracy.

neurolabusc commented 4 years ago

@Adenilson I have added a decompression benchmark 5decompress.sh. For using on ARM, one needs to comment out the lines in 1compile.sh that create the Intel zlib (which does not support ARM). This benchmark attempts to address criticisms by @sebpop regarding testing decompression. Specifically, each method tested contributes compressed files (at different compression levels) to the compressed corpus, so all methods are benchmarked decompressing the same files. On x86-64, the zlib-ng does really well, reflecting their focus on accelerating decompression. In contrast, CloudFlare developed their fork of zlib for their use case of compressing data on the server side.

I am puzzled by the eMAG performance. In particular as the gzip provided by the system outperforms pigz compiled against the system zlib. I would have thought the latter is always faster, as it is able to compute the CRC on a separate thread.

Method	i5-8259U	eMAG 8180	CN8890
pigz-CloudFlare	1675	4424	2944
pigz-ng	1335	5149	3378
pigz-System	1889	5576	3506
gzip	1925	3496	6672

For x86-64, I think @gdevenyi may appreciate this as he has been an advocate of zlib-ng while I have preferred CloudFlare. I guess it depends on the metric you use. The corpus is 343281792 bytes uncompressed, emphasizing the fact that the popular but old gzip format has poor decompression performance relative to the modern zstd.

As an aside, I also tested my neuroimaging tools on these systems computing a 3D Gaussian blur on a huge 4D dataset (niimath rest -s 2.548 out). Similar to many computations in my field, due to memory and disk I/O demands, this test only shows modest benefits from OpenMP. This is I/O and FP heavy, and outside the traditional strengths of ARM chips (though future generations promise big changes):

CPU	1 Core	All Cores
Ryzen 3900X	29286	12424
eMAG 8180	87640	30700
CN8890	675600	54120

gdevenyi commented 4 years ago

For x86-64, I think @gdevenyi may appreciate this as he has been an advocate of zlib-ng while I have preferred CloudFlare. I guess it depends on the metric you use.

Interesting results! From my perspective my advocacy for zlib-ng has been their open active github community and attention to cleaning up both the code and build system, and testing on a huge number of hardware platforms. I expect eventually they will integrate all performance enhancers from the various forks to reach parity, but will have a modern codebase and build system as well.

vielmetti commented 4 years ago

@neurolabusc -

I see that you have two systems that are reserved but turned off. In the interests of maximizing use, I'd like to either reclaim them (and return them to other people as needed) or reserve them for you (to make sure you have them in the future).

Either way is fine - or just one system would be fine - but the "powered down" state is the least useful, and I want to make sure you have the resources you need in case e.g. https://github.com/madler/pigz/pull/77 takes a long time to resolve.

Let me know your thoughts.

neurolabusc commented 4 years ago

@vielmetti thanks for setting this up. I would suggest you reclaim them. I had detected a problem with zlib-ng and a pull request to fix that was submitted 3 weeks ago. I had originally hoped to validate that all tests pass, but I am not sure how quickly pull requests are incorporated for that library.

Regardless, I now have a simple script for validating the outcome, so you or someone else with access to an ARM system could rapidly test this once the patch is submitted.

vielmetti commented 4 years ago

@neurolabusc Thank you! Instances have been reclaimed, and we'll look forward to any published results and ultimately to a merged PR.

WorksOnArm / equinix-metal-arm64-cluster