Closed edgarcosta closed 4 weeks ago
For most cases, compiler hints should be enough to generate close-to-optimal code utilizing SIMD. Say, nmod_vec_add
or something. The important part is here that one compiles FLINT with the correct compiler flags to generate this.
For these "trivial" cases, it is my belief that we do not want to generate code that resembles some other architecture. For instance, say you want to utilize some AVX512 instruction for systems that has AVX512 instructions, but the system you only have register width of 128 bits (such as ARM NEON), you effectively unroll every instruction four times, which I think is something that we would like to avoid.
For cases where an algorithm's speed is highly dependent on the use of specific instruction sets/throughput/latency/number of ports in a CPU/whatever, I don't think this is very useful.
Moreover, I believe it would just reduce the readability of the code, and it may also introduce unwanted unrollings of loops on some architectures.
Thanks for the explanation!
I will close the issue, given that you are the one writing most of the lower-level code.
I wouldn't rule out any abstraction layers, but I would consider writing our own in that case so that we know exactly what is going on under the hood. However, thanks for bringing this to my attention, didn't know this existed.
machine_vectors.h
is a decent start, no?
Indeed.
Yesterday I was made aware of xsimd: C++ wrappers for SIMD intrinsics and parallelized, optimized mathematical functions (SSE, AVX, AVX512, NEON, SVE)) which seems to be quite popular for C++ libraries.
While that one doesn't work for Flint, I wonder if something like SIMD Everywhere would help with the maintainability of lower-level instructions.
PS: I have not looked carefully at how we have written our routines, but I'm aware that we already do some abstractions.