Open nemequ opened 8 years ago
I'll have to look into the extensions to see if they offer the needed functionality for a fallback. Otherwise, I might at some stage port to other platforms directly if they're suitable.
Based on my (very) brief look at the code I'm pretty confident they do, but to be clear the main point is that I would very much like an implementation that will run everywhere, even if it's slow; OpenMP 4 just (hopefully) provides a way to make it reasonably fast.
I just created a new LZSSE-SIMDe (friendly) fork which uses SIMDe to let LZSSE run where SSE4.1 isn't.
SIMDe is still under heavy development; I haven't even started working on optimizing it, and I've only tested a few compilers (recent versions of GCC, clang, and PGI). It will probably be a while before I'm ready to make a PR, but I wanted to make you aware of the work.
Thanks for making me aware, will be interested to see the results.
I just noticed that LZSSE doesn't work on 32-bit, even if the CPU supports SSE 4.1, is that intentional? Just replacing the calls to _mm_cvtsi64_si128
would be enough to get it compiling (I haven't actually tested that, but it works fine in SIMDe where we emulate that call, as well as some other 64-bit specific functions, on 32-bit CPUs).
The default block size would also need to be reduced, otherwise malloc will fail. It would probably also be a good idea to verify that make sure bufferSize * sizeof(Arrival)
doesn't overflow size_t…
FWIW, with a reduced block size LZSSE-SIMDe works on ARM (a Raspberry Pi 2).
Yes, it's intentional that it doesn't work on 32bit, it was a conscious decision to exploit the wider/larger number of registers. It sounds like the SIMDe version is a good path to supporting 32bit as well.
It's a good point about the default block size in the example/verifying we don't overflow size_t. At some stage I was considering using a more limited size arrivals array and incremental output to reduce memory overhead, (although, I think the best use case of LZSSE's optimal parse is offline compression for fast decompression that will happen many times and that would be counter to that slightly).
It's fantastic progress to get something working on ARM!
I kind of forgot about this for a while, but SIMDe has been plugging along, particularly lately. It should work better now, and be much faster on non-SSE4.1 CPUs. There are lots more NEON implementations now, plus quite a few AltiVec and WebAssembly implementations. The README also has some updated benchmarking figures which are mildly interesting.
The "native aliases" support has also improved to the point where I'm comfortable using it, which has reduced the diff to practically nothing, and with the new simde-no-tests the submodule is a much more reasonable size.
If you're interested in merging this into LZSSE I can submit a PR (or, of course you can simply pull from my repo). I kept the README patch separate since I'm guessing you wouldn't want that. If you don't like submodules we do also have an amalgamated header.
If you don't want to use SIMDe, the LZSSE-SIMDe repo is still around for people who need it. Either way, LZSSE was has been a great test for SIMDe, so thanks :)
That sounds good, I'm excited to have a look. How hard would it be to keep SIMDe as an optional dependency so that it was only required for those platforms?
I must admit I haven't had a chance to look much at LZSSE for a while either, although I do have an idea for a new version.
That sounds good, I'm excited to have a look. How hard would it be to keep SIMDe as an optional dependency so that it was only required for those platforms?
Not hard. You could just use an ifdef with something like:
#if defined(LZSSE_USE_SIMDE)
#define SIMDE_ENABLE_NATIVE_ALIASES
#include <simde/x86/sse4.1.h>
#else
#include <smmintrin.h>
#endif
It would likely require people to use -I
to specify the include directory, unless they have simde installed system-wide (there is a simde-dev package on debian, and a simde-devel on Fedora 33).
To be clear, the only real advantage here is you don't have to include a copy of SIMDe. There is no penalty for using SIMDe if you don't need it; it will just call the native functions so it doesn't make the code slower, just more portable.
I must admit I haven't had a chance to look much at LZSSE for a while either, although I do have an idea for a new version.
Nice. Hopefully you have some time to implement it soon :)
I would love to see a portable version instead of relying directly on SSE. Perhaps using OpenMP 4's SIMD extensions (http://primeurmagazine.com/repository/PrimeurMagazine-AE-PR-12-14-32.pdf is a decent introduction)?
Even if it's just a slow portable fallback for platforms without SSE, it could still be useful.