Open barracuda156 opened 3 months ago
I mainly focus on x86 and arm; the PowerPC support isn't something I've really worked on yet. From searching around it sounds like all PowerPC processors support unaligned memory accesses. Do you agree? I'm not familiar with the differences between the various PowerPC macros. I'm surprised that __powerpc64__
is OS-specific, as normally for these architecture macros it's the compiler that matters, not the operating system. In any case, if you think the code should be checking for something else, please go ahead and open a pull request with your proposed change, with an explanation.
@ebiggers Thank you for responding!
I mainly focus on x86 and arm; the PowerPC support isn't something I've really worked on yet. From searching around it sounds like all PowerPC processors support unaligned memory accesses. Do you agree?
AFAIK, yes, they do, including G4/G5. (I am not a hardware expert, but I can look for documentation on this.)
I'm not familiar with the differences between the various PowerPC macros. I'm surprised that
__powerpc64__
is OS-specific, as normally for these architecture macros it's the compiler that matters, not the operating system.
PowerPC has OS-specific macros. I am not too sure which exactly are used for Linux, BSD and AIX (but I think at least AIX has its own unique), but I know for sure what Darwin uses (and, well, BeOS). They may also be compiler-specific, oddly.
Adding __POWERPC__
(in caps) will include all Darwin (both 32- and 64-bit ABI), as well as BeOS. I have seen that some versions of Clangs define this on BSD too, though I cannot confirm that. But as long as general detection goes, no extra condition is needed.
However, for 32-bit Linux/BSD, I think, another macro needed (__powerpc__
)?
AIX, apparently, uses __PPC
: https://reviews.llvm.org/D108917
Apple code used a whole bunch of macros for PowerPC detection: https://opensource.apple.com/source/WTF/WTF-7601.1.46.42/wtf/Platform.h.auto.html
In any case, if you think the code should be checking for something else, please go ahead and open a pull request with your proposed change, with an explanation.
Thank you, I can do that.
P. S. If you will add optimized code for PowerPC, there we need to detect support for specific ISA. Specifically, more recent POWER CPUs have VSX support, while G4 and G5 have Altivec only, and some early POWER and G3 have none at all. Basic info can be checked here: https://en.wikipedia.org/wiki/Power_ISA (specific one is contained in IBM specifications docs, which are publicly available).
By the way, how could I test if adding unaligned access support for PowerPC actually works here or at least does not break anything or introduce any unwanted effects (assuming the build itself succeeds)?
By the way, how could I test if adding unaligned access support for PowerPC actually works here or at least does not break anything or introduce any unwanted effects (assuming the build itself succeeds)?
You could run the benchmark
program to compare compression and decompression performance before and after.
If you want you can run the tests too, using ctest
. But the code does not rely on UNALIGNED_ACCESS_IS_FAST being correct for correctness anyway; it just affects performance.
Sure, will do it now.
UPD. How long running benchmark
is supposed to take by the way? It seems to do nothing: cpu is not loaded, process simply hangs like:
svacchanda@Sergeys-MacBook-Air ~ % /opt/local/var/macports/build/_opt_svacchanda_SonomaPorts_archivers_libdeflate/libdeflate/work/build/programs/benchmark -0
Benchmarking DEFLATE compression:
Compression level: 0
Chunk size: 1048576
Compression engine: libdeflate
Decompression engine: libdeflate
Processing standard input...
Here it is with 0, initially it was with default 6, but behavior is the same. (This is on M1, where I expected it to be breezing fast. But same on PowerPC too.)
FWIW, here are tests on PowerPC. There is no meaningful difference, and time between runs with the same code can differ more that between some runs with different code (random variation obscures any deterministic difference).
Running tests...
/opt/local/bin/ctest --force-new-ctest-process
Test project /opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_archivers_libdeflate/libdeflate/work/build
Start 1: test_checksums
1/8 Test #1: test_checksums ................... Passed 0.12 sec
Start 2: test_custom_malloc
2/8 Test #2: test_custom_malloc ............... Passed 0.01 sec
Start 3: test_incomplete_codes
3/8 Test #3: test_incomplete_codes ............ Passed 0.01 sec
Start 4: test_invalid_streams
4/8 Test #4: test_invalid_streams ............. Passed 0.01 sec
Start 5: test_litrunlen_overflow
5/8 Test #5: test_litrunlen_overflow .......... Passed 0.04 sec
Start 6: test_overread
6/8 Test #6: test_overread .................... Passed 0.01 sec
Start 7: test_slow_decompression
7/8 Test #7: test_slow_decompression .......... Passed 0.01 sec
Start 8: test_trailing_bytes
8/8 Test #8: test_trailing_bytes .............. Passed 0.01 sec
100% tests passed, 0 tests failed out of 8
Total Test time (real) = 0.24 sec
2. With a patch:
Test project /opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_archivers_libdeflate/libdeflate/work/build Start 1: test_checksums 1/8 Test #1: test_checksums ................... Passed 0.13 sec Start 2: test_custom_malloc 2/8 Test #2: test_custom_malloc ............... Passed 0.01 sec Start 3: test_incomplete_codes 3/8 Test #3: test_incomplete_codes ............ Passed 0.01 sec Start 4: test_invalid_streams 4/8 Test #4: test_invalid_streams ............. Passed 0.01 sec Start 5: test_litrunlen_overflow 5/8 Test #5: test_litrunlen_overflow .......... Passed 0.05 sec Start 6: test_overread 6/8 Test #6: test_overread .................... Passed 0.01 sec Start 7: test_slow_decompression 7/8 Test #7: test_slow_decompression .......... Passed 0.01 sec Start 8: test_trailing_bytes 8/8 Test #8: test_trailing_bytes .............. Passed 0.01 sec
100% tests passed, 0 tests failed out of 8
Total Test time (real) = 0.25 sec
But another run with the same patch:
Running tests... /opt/local/bin/ctest --force-new-ctest-process Test project /opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_archivers_libdeflate/libdeflate/work/build Start 1: test_checksums 1/8 Test #1: test_checksums ................... Passed 0.11 sec Start 2: test_custom_malloc 2/8 Test #2: test_custom_malloc ............... Passed 0.01 sec Start 3: test_incomplete_codes 3/8 Test #3: test_incomplete_codes ............ Passed 0.01 sec Start 4: test_invalid_streams 4/8 Test #4: test_invalid_streams ............. Passed 0.01 sec Start 5: test_litrunlen_overflow 5/8 Test #5: test_litrunlen_overflow .......... Passed 0.05 sec Start 6: test_overread 6/8 Test #6: test_overread .................... Passed 0.01 sec Start 7: test_slow_decompression 7/8 Test #7: test_slow_decompression .......... Passed 0.01 sec Start 8: test_trailing_bytes 8/8 Test #8: test_trailing_bytes .............. Passed 0.01 sec
100% tests passed, 0 tests failed out of 8
Total Test time (real) = 0.23 sec
`benchmark` does not seem to run for me (not just on PowerPC, but on aarch64), or I do something wrong with it.
Somewhat ironically, tests run slower – and much so – on M1:
---> Testing libdeflate
Executing: cd "/opt/local/var/macports/build/_opt_svacchanda_SonomaPorts_archivers_libdeflate/libdeflate/work/build" && /usr/bin/make test
Running tests...
/opt/local/bin/ctest --force-new-ctest-process
Test project /opt/local/var/macports/build/_opt_svacchanda_SonomaPorts_archivers_libdeflate/libdeflate/work/build
Start 1: test_checksums
1/8 Test #1: test_checksums ................... Passed 0.46 sec
Start 2: test_custom_malloc
2/8 Test #2: test_custom_malloc ............... Passed 0.22 sec
Start 3: test_incomplete_codes
3/8 Test #3: test_incomplete_codes ............ Passed 0.22 sec
Start 4: test_invalid_streams
4/8 Test #4: test_invalid_streams ............. Passed 0.22 sec
Start 5: test_litrunlen_overflow
5/8 Test #5: test_litrunlen_overflow .......... Passed 0.23 sec
Start 6: test_overread
6/8 Test #6: test_overread .................... Passed 0.22 sec
Start 7: test_slow_decompression
7/8 Test #7: test_slow_decompression .......... Passed 0.23 sec
Start 8: test_trailing_bytes
8/8 Test #8: test_trailing_bytes .............. Passed 0.23 sec
100% tests passed, 0 tests failed out of 8
Total Test time (real) = 2.02 sec
Or they just skip test cases on PowerPC, and result is misleading in this sense?
@ebiggers Going by documentation,
There is unaligned access support for 32-bit data. According to IBM:
The PowerPC takes a hybrid approach. Every PowerPC processor to date has hardware support for unaligned 32-bit integer access. While you still pay a performance penalty for unaligned access, it tends to be small. On the other hand, modern PowerPC processors lack hardware support for unaligned 64-bit floating-point access. When asked to load an unaligned floating-point number from memory, modern PowerPC processors will throw an exception and have the operating system perform the alignment chores in software. Performing alignment in software is much slower than performing it in hardware. https://developer.ibm.com/articles/pa-dalign/
There is no support of unaligned access for vector instructions: https://math-atlas.sourceforge.net/devel/assembly/vector_simd_pem.ppc.2005AUG23.pdf
FWIW, here are tests on PowerPC. There is no meaningful difference, and time between runs with the same code can differ more that between some runs with different code (random variation obscures any deterministic difference).
The ctest suite (which is separate from the benchmark program) only tests for correctness, not performance.
benchmark does not seem to run for me (not just on PowerPC, but on aarch64), or I do something wrong with it.
Perhaps you invoked it with no arguments, causing it to read its input file from standard input?
@ebiggers Is there documentation how to test it with benchmarks?
No. It's pretty straightforward, though; just build the benchmark program (-DLIBDEFLATE_BUILD_TESTS
needs to be passed to cmake) and pass it some large-ish files. For what you're trying to test, the default options should be fine, but it's also possible to run ./build/programs/benchmark --help
to see the different options that are available, e.g. setting the compression level.
The code here defines
UNALIGNED_ACCESS_IS_FAST
forppc64
on Linux/BSD: https://github.com/ebiggers/libdeflate/blob/275aa5141db6eda3587214e0f1d3a134768f557d/common_defs.h#L402-L419 But does not define it forppc
(any OS) andppc64
on Darwin.Is this intended or a mere omission? If this should apply to Big-endian PowerPC, then it is needed to add at least
__POWERPC__
(Darwin both 32- and 64-bit) there and probably__powerpc__
(Linux 32-bit?).