Are ugrep SSE and AVX binaries portable to non-SSE2 CPUs?

Genivia / ugrep

NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more

https://ugrep.com

BSD 3-Clause "New" or "Revised" License

2.56k stars 109 forks source link

Are ugrep SSE and AVX binaries portable to non-SSE2 CPUs? #368

Closed drok closed 5 months ago

drok commented 5 months ago

I don't have a way to test this, but while working on fixing various autotools issues, I noted that there is no Matcher::simd_match_sse2() function, counterpart to the other Matcher::simd_match_*().

If a binary built with SSE2 extensions somehow runs on a CPU without SSE2, the matcher runs into SSE2 code, without any way to bypass it. The non-extension slow path is masked by the #ifdef HAVE_SSE2 conditional define and is not reachable.

I don't currently have an easy way to confirm this; it would take running a SIMD binary on:

an emulator with SSE2 disabled, or
a VM with SSE2 disabled, as described in this vmware support question seeking to test some software in an env similar to actual end-user hardware, or
a physical CPU older than Pentium 4 (eg, early version of linux with i386 support on some old hardware), or
Xeon Phi CPU (eg, the world fastest supercomputer in 2013)

Also interesting: CPU without sse2 crashes Firefox update (2017) - I hope I don't need to make a stronger case why CPU's without SSE2 should not crash ugrep (the "break-neck" grep of the 2020's)

Am I misreading the code?
Does the test suite need to cover the non-SSE2 CPU case?

genivia-inc commented 5 months ago

Binary portability is supported by ugrep by default, but there are (reasonable) limitations to what is possible and how far we want to go.

To make ugrep binary portable to non-SSE2 machines:

$ ./build.sh --disable-sse2

When compiled on x86 and x64 CPUs, ugrep is binary portable to SSE2, AVX2 and/or AVX512 capable CPUs. We should expect a SSE2/AVX2/AVX512 compiled binary to run at least on a SSE2 capable CPU. This should be a reasonable assumption, since virtually all x86 and x64 CPUs support at least SSE2 these days. See SIMD_FLAGS="-msse2 -DHAVE_SSE2" in configure.ac when compiled on a SSE2, AVX2 and/or AVX512 capable CPU. Likewise, only some parts of the code are compiled with SIMD_AVX2_FLAGS="-mavx2" or SIMD_AVX512BW_FLAGS="-mavx512bw" and these are guarded by runtime AVX2/AVX512 checks.

The parts that are SSE2/AVX2/AVX512 compiled with intrinsics are compiled separately with the proper compiler flags. These are located in the lib directory. The SIMD intrinsics in these code parts are guarded by runtime SIMD checks to call the SSE2, AVX2 or AVX512 compiled code. These checks are in effect if the code was compiled on an SSE2/AVX2/AVX512 machine. Otherwise NEON/AArch64 is checked and compiled on those CPUs.

I am not in favor to disable SSE2 throughout ugrep when compiled for x86 and x64 machines. Perhaps adding a runtime SSE2 check to terminate ugrep immediately with a warning when SSE2 is not available.

drok commented 5 months ago

So... ugrep should work (not crash) on CPUs that lack AVX, but is not expected to work on CPUs that lack SSE2 ? It can and does detect AVX at run-time (and do the right thing), but does not detect SSE2 at run-time because all x64 CPUs are presumed to have it?

I would expect that if ugrep makes into a distro that works on x86-64 it will not crash on any computers where the distro can be installed and run.

It turns out testing this is straight forward:

qemu-x86_64 -cpu qemu64,sse2=no /usr/bin/ugrep needle haystack.txt

This doesn't work on Ubuntu because libc itself requires SSE2 there.

The unreachable code suggests that failing to detect SSE2 at runtime is not intentional.

genivia-inc commented 5 months ago

SSE2 support on x86 and x64 CPUs is assumed, which is reasonable these days, but can be explicitly disabled when that is a problem on some legacy machine if binary portability is needed. That means it is fine to build and execute on a non-SSE2 machine, because there is no use of SSE2 (nor AVX2 nor AVX512). There is a bare fallback in ugrep that does not use any SIMD (i.e. ./build.sh --disable-sse2 explicitly and implicitly when SSE2 is not auto-detected on the machine).

There is no crash or an issue when built on a non-SSE2 machine on which the code is also run.

I don't know if you were referring to binary portability. This is possible across SSE2 <-> AVX2 <-> AVX512 machines when ugrep is compiled on any one of them. I wanted to have that ability and it is important. The same for the ugrep.exe built with MSVC++ on my Windows machine and included in the repo.

genivia-inc commented 5 months ago

I hope this clarifies the initial question that is related to #339. We should merge further Q&A and discussions with #339. I'll label this "duplicate" just so that anyone following this knows it's continued on with #339.

drok commented 5 months ago

I might have misunderstood, but it sounded like a binary built on an SSE2 CPU will not run on a non-SSE2 CPU (ie, not portable).

There is a way to create portable binaries, using the --disable-sse2 flag, but it is neither the default, nor the recommended configuration, as this disables the core benefit of this program (extra speed on CPUs that have the SIMD instructions).

While you don't ack that X86_64 CPUs without SSE2 are in use today, you expect their users will compile their binaries specially, with the ~--dont-crash~--disable-sse2 flag.

A Xeon Phi user cannot copy ugrep from Ubuntu or any of the mainstream distros. That binary is not portable to the Xeon Phi CPU.

I'm not going to press this because I'm not directly affected by this issue. I just noticed it while reviewing the source code.

A patch is available in https://github.com/Mergesium/ugrep/commit/9ba524f049abdcd0c66399ee50f2d194b8eab8d1 as mentioned above in the activity feed.

genivia-inc commented 5 months ago

I might have misunderstood, but it sounded like a binary built on an SSE2 CPU will not run on a non-SSE2 CPU (ie, not portable).

That is right, but it is intentional and not a bug. I don't think there is any reason to worry here. Everything including libc is built with SSE2 these days. I am not willing to sacrifice SSE2 just to archive binary portability to legacy CPUs. Perhaps an upfront runtime check in ugrep for SSE2 could be added to quit gracefully.

genivia-inc commented 5 months ago

I should clarify that enabling SSE2 for the compiler also typically optimizes parts that aren't using SSE2 intrinsics explicitly. So removing SSE2 as a compile flag is not preferable IMO. Also, the current code base does not include a separate SSE2 matcher build step like the AVX2 and AVX512 have for multi-version run-time SIMD execution. Again, what this means is that once the code is built for SSE2 it expects an SSE2 CPU.

genivia-inc commented 5 months ago

Let me clarify some misunderstandings in these posts here (and in the PR that supposedly fixes these problems) on how ugrep is supposed to be properly compiled and run.

building on an SSE-capable machine does not require the binary to be portable down to a non-SSE-capable machine, these virtually no longer exist. Removing -msse2 from compiling all sources will make the ugrep binary suboptimal for performance
when AVX512-BW is detected at compile time, we only need to set HAVE_AVX512BW because AVX2 is also supported, i.e. the runtime code should check for AVX512BW and AVX2 in this case to support binary portability down (requiring at least SSE2)
when AVX2 is detected at compile time, we only need to set HAVE_AVX2 because SSE2 is also supported
for ARM, all sources should be compiled with NEON when detected at compile time, not just the matcher, just like we compile SSE2 because non-NEON devices are rare, so I don't worry about binary portability down in this case. It is also bad for performance to not compile with NEON enabled, because I use vectorizable loops and other constructed that the compiler will optimize for NEON i.e. without using intrinsics