ebiggers / libdeflate

Heavily optimized library for DEFLATE/zlib/gzip compression and decompression
MIT License
967 stars 165 forks source link

Unaligned access on PowerPC: forgotten macros in `common_defs.h` #362

Open barracuda156 opened 3 months ago

barracuda156 commented 3 months ago

The code here defines UNALIGNED_ACCESS_IS_FAST for ppc64 on Linux/BSD: https://github.com/ebiggers/libdeflate/blob/275aa5141db6eda3587214e0f1d3a134768f557d/common_defs.h#L402-L419 But does not define it for ppc (any OS) and ppc64 on Darwin.

Is this intended or a mere omission? If this should apply to Big-endian PowerPC, then it is needed to add at least __POWERPC__ (Darwin both 32- and 64-bit) there and probably __powerpc__ (Linux 32-bit?).

ebiggers commented 3 months ago

I mainly focus on x86 and arm; the PowerPC support isn't something I've really worked on yet. From searching around it sounds like all PowerPC processors support unaligned memory accesses. Do you agree? I'm not familiar with the differences between the various PowerPC macros. I'm surprised that __powerpc64__ is OS-specific, as normally for these architecture macros it's the compiler that matters, not the operating system. In any case, if you think the code should be checking for something else, please go ahead and open a pull request with your proposed change, with an explanation.

barracuda156 commented 3 months ago

@ebiggers Thank you for responding!

I mainly focus on x86 and arm; the PowerPC support isn't something I've really worked on yet. From searching around it sounds like all PowerPC processors support unaligned memory accesses. Do you agree?

AFAIK, yes, they do, including G4/G5. (I am not a hardware expert, but I can look for documentation on this.)

I'm not familiar with the differences between the various PowerPC macros. I'm surprised that __powerpc64__ is OS-specific, as normally for these architecture macros it's the compiler that matters, not the operating system.

PowerPC has OS-specific macros. I am not too sure which exactly are used for Linux, BSD and AIX (but I think at least AIX has its own unique), but I know for sure what Darwin uses (and, well, BeOS). They may also be compiler-specific, oddly.

Adding __POWERPC__ (in caps) will include all Darwin (both 32- and 64-bit ABI), as well as BeOS. I have seen that some versions of Clangs define this on BSD too, though I cannot confirm that. But as long as general detection goes, no extra condition is needed. However, for 32-bit Linux/BSD, I think, another macro needed (__powerpc__)?

AIX, apparently, uses __PPC: https://reviews.llvm.org/D108917

Apple code used a whole bunch of macros for PowerPC detection: https://opensource.apple.com/source/WTF/WTF-7601.1.46.42/wtf/Platform.h.auto.html

In any case, if you think the code should be checking for something else, please go ahead and open a pull request with your proposed change, with an explanation.

Thank you, I can do that.

P. S. If you will add optimized code for PowerPC, there we need to detect support for specific ISA. Specifically, more recent POWER CPUs have VSX support, while G4 and G5 have Altivec only, and some early POWER and G3 have none at all. Basic info can be checked here: https://en.wikipedia.org/wiki/Power_ISA (specific one is contained in IBM specifications docs, which are publicly available).

barracuda156 commented 3 months ago

By the way, how could I test if adding unaligned access support for PowerPC actually works here or at least does not break anything or introduce any unwanted effects (assuming the build itself succeeds)?

ebiggers commented 3 months ago

By the way, how could I test if adding unaligned access support for PowerPC actually works here or at least does not break anything or introduce any unwanted effects (assuming the build itself succeeds)?

You could run the benchmark program to compare compression and decompression performance before and after.

If you want you can run the tests too, using ctest. But the code does not rely on UNALIGNED_ACCESS_IS_FAST being correct for correctness anyway; it just affects performance.

barracuda156 commented 3 months ago

Sure, will do it now.

UPD. How long running benchmark is supposed to take by the way? It seems to do nothing: cpu is not loaded, process simply hangs like:

svacchanda@Sergeys-MacBook-Air ~ % /opt/local/var/macports/build/_opt_svacchanda_SonomaPorts_archivers_libdeflate/libdeflate/work/build/programs/benchmark -0
Benchmarking DEFLATE compression:
    Compression level: 0
    Chunk size: 1048576
    Compression engine: libdeflate
    Decompression engine: libdeflate
Processing standard input...

Here it is with 0, initially it was with default 6, but behavior is the same. (This is on M1, where I expected it to be breezing fast. But same on PowerPC too.)

barracuda156 commented 3 months ago

FWIW, here are tests on PowerPC. There is no meaningful difference, and time between runs with the same code can differ more that between some runs with different code (random variation obscures any deterministic difference).

  1. Without patch for unaligned access:
    
    Running tests...
    /opt/local/bin/ctest --force-new-ctest-process 
    Test project /opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_archivers_libdeflate/libdeflate/work/build
    Start 1: test_checksums
    1/8 Test #1: test_checksums ...................   Passed    0.12 sec
    Start 2: test_custom_malloc
    2/8 Test #2: test_custom_malloc ...............   Passed    0.01 sec
    Start 3: test_incomplete_codes
    3/8 Test #3: test_incomplete_codes ............   Passed    0.01 sec
    Start 4: test_invalid_streams
    4/8 Test #4: test_invalid_streams .............   Passed    0.01 sec
    Start 5: test_litrunlen_overflow
    5/8 Test #5: test_litrunlen_overflow ..........   Passed    0.04 sec
    Start 6: test_overread
    6/8 Test #6: test_overread ....................   Passed    0.01 sec
    Start 7: test_slow_decompression
    7/8 Test #7: test_slow_decompression ..........   Passed    0.01 sec
    Start 8: test_trailing_bytes
    8/8 Test #8: test_trailing_bytes ..............   Passed    0.01 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) = 0.24 sec


2. With a patch:

Test project /opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_archivers_libdeflate/libdeflate/work/build Start 1: test_checksums 1/8 Test #1: test_checksums ................... Passed 0.13 sec Start 2: test_custom_malloc 2/8 Test #2: test_custom_malloc ............... Passed 0.01 sec Start 3: test_incomplete_codes 3/8 Test #3: test_incomplete_codes ............ Passed 0.01 sec Start 4: test_invalid_streams 4/8 Test #4: test_invalid_streams ............. Passed 0.01 sec Start 5: test_litrunlen_overflow 5/8 Test #5: test_litrunlen_overflow .......... Passed 0.05 sec Start 6: test_overread 6/8 Test #6: test_overread .................... Passed 0.01 sec Start 7: test_slow_decompression 7/8 Test #7: test_slow_decompression .......... Passed 0.01 sec Start 8: test_trailing_bytes 8/8 Test #8: test_trailing_bytes .............. Passed 0.01 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) = 0.25 sec

But another run with the same patch:

Running tests... /opt/local/bin/ctest --force-new-ctest-process Test project /opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_archivers_libdeflate/libdeflate/work/build Start 1: test_checksums 1/8 Test #1: test_checksums ................... Passed 0.11 sec Start 2: test_custom_malloc 2/8 Test #2: test_custom_malloc ............... Passed 0.01 sec Start 3: test_incomplete_codes 3/8 Test #3: test_incomplete_codes ............ Passed 0.01 sec Start 4: test_invalid_streams 4/8 Test #4: test_invalid_streams ............. Passed 0.01 sec Start 5: test_litrunlen_overflow 5/8 Test #5: test_litrunlen_overflow .......... Passed 0.05 sec Start 6: test_overread 6/8 Test #6: test_overread .................... Passed 0.01 sec Start 7: test_slow_decompression 7/8 Test #7: test_slow_decompression .......... Passed 0.01 sec Start 8: test_trailing_bytes 8/8 Test #8: test_trailing_bytes .............. Passed 0.01 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) = 0.23 sec


`benchmark` does not seem to run for me (not just on PowerPC, but on aarch64), or I do something wrong with it.
barracuda156 commented 3 months ago

Somewhat ironically, tests run slower – and much so – on M1:

--->  Testing libdeflate
Executing:  cd "/opt/local/var/macports/build/_opt_svacchanda_SonomaPorts_archivers_libdeflate/libdeflate/work/build" && /usr/bin/make test 
Running tests...
/opt/local/bin/ctest --force-new-ctest-process 
Test project /opt/local/var/macports/build/_opt_svacchanda_SonomaPorts_archivers_libdeflate/libdeflate/work/build
    Start 1: test_checksums
1/8 Test #1: test_checksums ...................   Passed    0.46 sec
    Start 2: test_custom_malloc
2/8 Test #2: test_custom_malloc ...............   Passed    0.22 sec
    Start 3: test_incomplete_codes
3/8 Test #3: test_incomplete_codes ............   Passed    0.22 sec
    Start 4: test_invalid_streams
4/8 Test #4: test_invalid_streams .............   Passed    0.22 sec
    Start 5: test_litrunlen_overflow
5/8 Test #5: test_litrunlen_overflow ..........   Passed    0.23 sec
    Start 6: test_overread
6/8 Test #6: test_overread ....................   Passed    0.22 sec
    Start 7: test_slow_decompression
7/8 Test #7: test_slow_decompression ..........   Passed    0.23 sec
    Start 8: test_trailing_bytes
8/8 Test #8: test_trailing_bytes ..............   Passed    0.23 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) =   2.02 sec

Or they just skip test cases on PowerPC, and result is misleading in this sense?

barracuda156 commented 3 months ago

@ebiggers Going by documentation,

  1. There is unaligned access support for 32-bit data. According to IBM:

    The PowerPC takes a hybrid approach. Every PowerPC processor to date has hardware support for unaligned 32-bit integer access. While you still pay a performance penalty for unaligned access, it tends to be small. On the other hand, modern PowerPC processors lack hardware support for unaligned 64-bit floating-point access. When asked to load an unaligned floating-point number from memory, modern PowerPC processors will throw an exception and have the operating system perform the alignment chores in software. Performing alignment in software is much slower than performing it in hardware. https://developer.ibm.com/articles/pa-dalign/

  2. There is no support of unaligned access for vector instructions: https://math-atlas.sourceforge.net/devel/assembly/vector_simd_pem.ppc.2005AUG23.pdf

ebiggers commented 3 months ago

FWIW, here are tests on PowerPC. There is no meaningful difference, and time between runs with the same code can differ more that between some runs with different code (random variation obscures any deterministic difference).

The ctest suite (which is separate from the benchmark program) only tests for correctness, not performance.

benchmark does not seem to run for me (not just on PowerPC, but on aarch64), or I do something wrong with it.

Perhaps you invoked it with no arguments, causing it to read its input file from standard input?

barracuda156 commented 3 months ago

@ebiggers Is there documentation how to test it with benchmarks?

ebiggers commented 3 months ago

No. It's pretty straightforward, though; just build the benchmark program (-DLIBDEFLATE_BUILD_TESTS needs to be passed to cmake) and pass it some large-ish files. For what you're trying to test, the default options should be fine, but it's also possible to run ./build/programs/benchmark --help to see the different options that are available, e.g. setting the compression level.