Closed mayeut closed 7 years ago
The decoding loop now uses a composite 128 bytes LUT (two 64 bytes LUT)
Speed-up on iPhone SE using Apple LLVM version 8.0.0 (clang-800.0.38)
iPhone SE
Apple LLVM version 8.0.0 (clang-800.0.38)
NEON64: +130% compared to previous version
Full results before & after modifications follows. Before:
Filling buffer with 10.0 MB of random data... Testing with buffer size 10 MB, fastest of 100 * 1 NEON64 encode 4098.43 MB/sec NEON64 decode 1726.01 MB/sec plain encode 1060.69 MB/sec plain decode 1236.70 MB/sec Testing with buffer size 1 MB, fastest of 100 * 10 NEON64 encode 6550.93 MB/sec NEON64 decode 1744.40 MB/sec plain encode 1062.92 MB/sec plain decode 1235.20 MB/sec Testing with buffer size 100 KB, fastest of 100 * 100 NEON64 encode 6537.48 MB/sec NEON64 decode 1733.39 MB/sec plain encode 1063.61 MB/sec plain decode 1233.14 MB/sec Testing with buffer size 10 KB, fastest of 1000 * 100 NEON64 encode 6499.58 MB/sec NEON64 decode 1751.95 MB/sec plain encode 1080.12 MB/sec plain decode 1237.07 MB/sec Testing with buffer size 1 KB, fastest of 1000 * 1000 NEON64 encode 5840.41 MB/sec NEON64 decode 1700.96 MB/sec plain encode 1059.08 MB/sec plain decode 1206.38 MB/sec
After:
Filling buffer with 10.0 MB of random data... Testing with buffer size 10 MB, fastest of 100 * 1 NEON64 encode 4102.50 MB/sec NEON64 decode 3984.66 MB/sec plain encode 1061.91 MB/sec plain decode 1234.36 MB/sec Testing with buffer size 1 MB, fastest of 10 * 100 NEON64 encode 6508.42 MB/sec NEON64 decode 3984.97 MB/sec plain encode 1060.76 MB/sec plain decode 1231.69 MB/sec Testing with buffer size 100 KB, fastest of 100 * 100 NEON64 encode 6534.92 MB/sec NEON64 decode 3989.63 MB/sec plain encode 1062.90 MB/sec plain decode 1233.14 MB/sec Testing with buffer size 10 KB, fastest of 1000 * 100 NEON64 encode 6476.27 MB/sec NEON64 decode 3968.43 MB/sec plain encode 1082.06 MB/sec plain decode 1238.31 MB/sec Testing with buffer size 1 KB, fastest of 1000 * 1000 NEON64 encode 5840.41 MB/sec NEON64 decode 3923.26 MB/sec plain encode 1065.29 MB/sec plain decode 1220.13 MB/sec
This is really clever, good work! I'll review it further tomorrow or in the weekend.
Thanks, merged. I rebased the commit onto Master so that I could do a fast-forward merge.
The decoding loop now uses a composite 128 bytes LUT (two 64 bytes LUT)
Speed-up on
iPhone SE
usingApple LLVM version 8.0.0 (clang-800.0.38)
NEON64: +130% compared to previous version
Full results before & after modifications follows. Before:
After: