Reenable C backend for non-SIMD platforms

markos commented 1 year ago

This is rather important. Many platforms don't have SIMD at all. Original pre-Intel Hyperscan did include a C backend, we plan to reenable that. This would also be a good indication of how much faster a SIMD port is for each platform.

mr-c commented 1 year ago

@markos Will you be using SIMDe for the C backend? Please let us know if we are missing any needed intrinsics and I'll try to fast track them.

markos commented 1 year ago

This is not decided yet. But it's a suggestion worth investigating. I will be looking at the whole SIMD approach soon, so SIMDe is an obvious choice. The biggest problem I've had with similar approaches is that they are too x86-centric and that is bad for every other platform (eg emulating movemasks on non-x86).

markos commented 1 year ago

Furthermore, I should clarify that the C backend will be 2-fold. First a C backend for SIMD, but also reinstating pure scalar algorithms. Both have to be done.

markos commented 1 year ago

@mr-c I'd like to evaluate SIMDe as it's next on my to-do list. You mentioned it also has a C backend. Could you please point me to how to enable it? Also, one aspect that we would like to have is emulation of wider vectors, eg. emulate 256-bit/512-bit SIMD on NEON or VSX. Is that available?

mr-c commented 1 year ago

@markos Yes, SIMDe works with C/C++ codebases. See https://github.com/simd-everywhere/simde#usage When I'm adapting a typical x86-64 SIMD using codebase for Debian, I use these notes https://wiki.debian.org/SIMDEverywhere#Approach

SIMDe implementation of 256 & 512 bit AVX/AVX2/AVX512 x86-64 intrinsics often (but not always) have optimized NEON and VSX versions that are selected automatically when compiling on those platforms.

markos commented 1 year ago

I wasn't clear enough, I'm looking for a C backend, in order to run SIMDe code on a non-SIMD platform -or one that does not currently have SIMD support enabled. Essentially a way to emulate SIMD using plain C. This would enable vectorscan to run on platforms without current SIMD support.

Reason: to enable running on very new architectures without SIMD support.

mr-c commented 1 year ago

Maybe a definition will help:

"SIMDe" == the SIMD Everywhere drop in header-only library that implements SIMD intrinsics (SSE, AVX, NEON, etc..) on various architectures.

SIMD Everywhere is the "way to emulate SIMD using plain C" that you are asking for :-)

markos commented 1 year ago

that is great to hear! So how do I enable just the C backend? Assuming I have a system with non-supported SIMD, or one with supported SIMD but I disable that support. If I integrate vectorscan with simde, is there a define that forces the C backend? Essentially that's what I'm asking.

mr-c commented 1 year ago

The easy way enable portability with SIMDe is to replace include <x86intrin.h> with

#define SIMDE_ENABLE_NATIVE_ALIASES=1
#include <simde/x86/avx512.h>

and add SIMDe to your include path.

To force the usage of the non-optimized implementations, you can define SIMDE_NO_NATIVE prior to the import (or on the compiler command line)

markos commented 1 year ago

@mr-c so, this proved to be easier than I expected, I now have a working SIMDe-based backend that I'm testing on a platform without supported SIMD (Loongson64). Actually running the unit tests now, if everything works, I'll test the SIMDe backend on other arches as well and compare performance.

markos commented 1 year ago

Ok, it's slower of course, but out of 20k unit tests, only 4 failures, not bad at all! And easy to fix from what it seems. Thanks for the suggestions.

markos commented 11 months ago

Initial implementation added: https://github.com/VectorCamp/vectorscan/tree/feature/enable-simde-backend Need to test if it works on x86/arm/ppc64le architectures and also add an extra flag (SIMDE_NATIVE?) to enable alternative code for native paths for those architectures and compare performance between SIMDe and vectorscan's native implementation.

All tests pass.

mr-c commented 11 months ago

@markos Super cool!

As for flags, if you set the architecturally appropriate equivalent of -march=native (or -mcpu={what CPU you actually have} then SIMDE should pick up on the features available.

On GCC/clang, please add -fopenmp-simd -DSIMDE_ENABLE_OPENMP to your CFLAGS/CXXFLAGS.

And we recommend -O3 as well, but I think you have that already.

markos commented 11 months ago

arch detection is done separately, I am testing it now on an Arm64 system and no surprise there are a lot of build failures as it competes with existing definitions, but should get these fixed quickly. Similarly I will do the tests for the other architectures. As for OpenMP, I will leave this out, at least for now, threading is not supposed to be done internally within Vectoscan.

mr-c commented 11 months ago

-fopenmp-simd doesn't bring in the OpenMP runtime nor threading; it helps the compiler make use of the OpenMP loop vectorization hints we have in the SIMDe codebase (in case we didn't come up with an optimized implementation for a particular intrinsic for the given architecture)

See https://github.com/simd-everywhere/simde#openmp-4-simd for a fuller explanation https://www.openmp.org/spec-html/5.0/openmpsu42.html https://github.com/simd-everywhere/simde/blob/471a34285aa6909d5b9b9ff3dcebfa6acf3bce47/simde/simde-common.h#L355-L371

https://gcc.gnu.org/onlinedocs/libgomp/Enabling-OpenMP.html

The -fopenmp-simd flag can be used to enable a subset of OpenMP directives that do not require the linking of either the OpenMP runtime library or the POSIX threads library.

https://clang.llvm.org/docs/UsersManual.html#openmp-features

Use -fopenmp-simd to enable OpenMP simd features only, without linking the runtime library; for combined constructs (e.g. #pragma omp parallel for simd) the non-simd directives and clauses will be ignored.

markos commented 11 months ago

I see, I will check this out then, thanks for the clarification!

markos commented 11 months ago

ok, compilation is fixed but I'm getting many failing tests on Arm/SIMDe, I will need to investigate these, it's probably something simple.

markos commented 11 months ago

Fixed in #203

markos commented 11 months ago

@mr-c Benchmarks will follow soon in the wiki, but I noticed something very interesting, enabling the SIMDe SSE4.2 native backend for Power was consistently ~20% faster than my native VSX port :) It was the other way around for Neon though :)

In any case, the best thing is that it allows vectorscan to run on SIMD-less architectures, thanks for a great library!

victorjulien commented 11 months ago

Does this mean that vectorscan should work on essentially all architectures too? E.g. something like Risc V or Mips? Trying to see if in a project like Suricata we can go "all in" on the vectorscan API w/o the need for fallback code for platforms/architectures where vectorscan may not be available.

markos commented 11 months ago

Does this mean that vectorscan should work on essentially all architectures too? E.g. something like Risc V or Mips? Trying to see if in a project like Suricata we can go "all in" on the vectorscan API w/o the need for fallback code for platforms/architectures where vectorscan may not be available.

@victorjulien It means exactly that, I was able to run Vectorscan on a Loongson system -which does have a SIMD unit but is not yet supported in vectorscan, there is a PR pending. In fact this was exactly where the port was developed on, to make sure it will not accidentally execute any native SIMD instructions. Of course it will be slower but it means you can have a consistent API. And when native support is added in SIMDe, we can enable it with a single compile flag. I would still go for the native ports eventually but having SIMDe means it will work out of the box initially.

markos commented 11 months ago

Only thing to implement to ensure that it can run on all platforms is adding BE support, this is also being considered but not decided yet.

victorjulien commented 11 months ago

Does this mean that vectorscan should work on essentially all architectures too? E.g. something like Risc V or Mips? Trying to see if in a project like Suricata we can go "all in" on the vectorscan API w/o the need for fallback code for platforms/architectures where vectorscan may not be available.

@victorjulien It means exactly that, I was able to run Vectorscan on a Loongson system -which does have a SIMD unit but is not yet supported in vectorscan, there is a PR pending. In fact this was exactly where the port was developed on, to make sure it will not accidentally execute any native SIMD instructions. Of course it will be slower but it means you can have a consistent API. And when native support is added in SIMDe, we can enable it with a single compile flag. I would still go for the native ports eventually but having SIMDe means it will work out of the box initially.

Amazing work, thanks!

Jc2k commented 11 months ago

Does this work with fat runtime?

markos commented 11 months ago

Does this work with fat runtime?

It could but there is little point in enabling fat runtime for it for the current architectures, the only cases I could think of platforms where a SIMD unit is optional, like eg Arm 32-bit where Neon is optional, or PowerPC 32-bit where again Altivec is optional. But then again, these 32-bit architectures are not supported anyway -and we are not sure we will continue supporting 32-bit in general. Unless there is a valid use case to support it. Possibly with RISC-V as well, but I don't have actual RISC-V hardware with RVV to test here anyway. Is there a particular use case you have in mind?

Jc2k commented 11 months ago

I think there are amd64 chips that are supported by distros (e.g. RH technically compiles for the very first amd64 chip) that have SSE2 and not SSE3, and iirc hyperscan targets the "core2" baseline (SSE3) as a minimum?

Unless vectorscan has an SSE2 backend (sorry could have easily missed it!) then I guess enabling this in fat runtime would technically mean that amd64 vectorscan would work on the same CPU's that distros like RH work on, even if most people do have SSE3 as a minimum? Certainly not groundbreaking, but potentially valuable to packagers.

Not a use case for me personally, the oldest thing I have is SSE3.

mr-c commented 11 months ago

Yes, for Debian amd64 we have to support SSE2 only system (runtime detection, CPU dispatch, etc.. of higher levels is okay, of course)

https://wiki.debian.org/ArchitectureSpecificsMemo#Architecture_baselines https://wiki.debian.org/InstructionSelection

markos commented 11 months ago

@Jc2k I see your point, yes, if 32-bit i386 is to be supported, then indeed we could drop the baseline to SSE2 so as to support the older chips. We will consider 32-bit in general and it should be decided for next release.

Jc2k commented 11 months ago

So if you decide to not support 32-bit i386, then you will also at the same time decide not to support 64-bit x86_64 chips that don't have SSE3?

markos commented 11 months ago

it's not as simple, there are more things than SIMD that involve special casing. But in short, given that AVX2 is more than 15 years old, the thought of increasing the base line dependency has crossed my mind yes. :)

Jc2k commented 11 months ago

Fedora discussed exactly that this last cycle. In short most of their developers didn't meet that baseline. SSE3 was considered more reasonable, but for now they are sticking with SSE2 for 64-bit.

Fedora is upstream of RH, so it's going to be a looong time for RH people to even benefit from SSE3 in distro provides packages.

😅

markos commented 11 months ago

But vectorscan is not a distro nor do we have the resources to support all the possible configurations, even the current supported list in our CI is more than many other projects currently do: https://buildbot-ci.vectorcamp.gr/#/grid We just added SIMDe + SIMDe native configurations for every architecture in that list, and we expect to have Loongson in there soon, plus others in the near future: RISC-V, MIPS. However the situation with Intel is already too complicated, for this reason we are considering limiting the options for x86 to just AVX2/AVX512, and leave SSE2-SSE4.2 for only 32-bit CPUs -IF 32-bit support stays.

markos commented 11 months ago

in any case, these are just thoughts at the moment, and this ticket is not really the best place for this discussion :)

Jc2k commented 11 months ago

Thank you. That last paragraph was the bit I was missing. I couldn't understand what 32-bit had to do with my question.

So if I understand your plans correctly you may drop support for SSE3 entirely and you do not plan to add SIMDe to fat runtime on x86_64. That's very useful to know.

Thanks for answering.

markos commented 11 months ago

"may" is the keyword here. But regarding SIMDe on x86_64 fat runtime, I don't know if there is a reason for that, I am not aware of any widely available 64-bit CPU that lacks SSE4.2 at the moment. Note the "widely available", we are talking about almost 20 years old tech here. AVX2 is here since 2008. But please use another ticket for this if you think it should be supported/discussed.

VectorCamp / vectorscan

Reenable C backend for non-SIMD platforms #158