target more generic CPU micro-architectures

lexming commented 2 years ago

Currently, the software layer is built on a host machine for that host machine (i.e. --march=native). However, on client side, we are potentially using those same binaries in a variety of architectures. For instance, we have an installation for haswell, but in archdetect (#187) any AVX2-only system will be identified as haswell and use those binaries. This can work sometimes (e.g. broadwell and haswell have 99% the same instruction set), but for others it will be problematic.

Since we will (probably) never have the resources to build binaries for all x86_64 CPU micro-architectures, I propose that we move to more generic builds and label those installations accordingly to avoid any confusion (i.e. not with an existing arch name, unless it will only be used in that CPU arch).

As discussed in Slack, a good starting point could be the standard CPU feature levels supported in gcc and clang:

We could replace the existing installations with the following:

change the build for Haswells to use -march=x86-64-v3 and distribute those binaries in x86_64/v3
change the build for Skylakes to use -march=x86-64-v4 and distribute those binaries in x86_64/v4

Technically, we could even use x86_64/v3 for AMD zen, zen2 and zen3. But if these generic definitions do not work well for certain CPU archs, we can always add new ones custom-made or have installations that specifically target a single CPU arch. For instance, based on the current situation, let's say that we wanted to keep zen3 separate to be able to further optimize that system

"x86_64/v3"         "GenuineIntel"    "avx2 fma"        # Intel Haswell, Broadwell
"x86_64/v4"         "GenuineIntel"    "avx2 fma avx512f avx512bw avx512cd avx512dq avx512vl"    # Intel Skylake, Cascade Lake
"x86_64/v3"         "AuthenticAMD"    "avx2 fma"        # AMD Rome
"x86_64/amd/zen3"   "AuthenticAMD"    "avx2 fma vaes"       # AMD Milan, Milan-X

What are your thoughts?

boegel commented 2 years ago

I'm not in favor of going with more generic names, since that's going to cause more confusion ("v3" is essentially meaningless, and it that name is all you have, how the hell do you figure out what it corresponds to)...

Perhaps different yet descriptive names like avx2_fma are worth considering, but v3 is bad imho. Especially because x86+64/v2 means SSE4 and x86_64/v3 means AVX+AVX2; there's nothing in between v2 and v3, and that doesn't make sense from a performance perspective imho.

Maybe this boils down to misinterpreting things though: I like the haswell name because most people in the HPC community likely know what this stands for, and it basically means "software installations in /haswell are compatible with CPUs that support an instruction set like Intel Haswell CPUs. It does not mean "only works on Haswell CPUs". It's just the commonly accepted name for the Intel Haswell microarchitecture, which is known to imply "supports AVX2 instructions".

If we're careful enough about picking the CPU microarchitecture of the build hosts we use, then we shouldn't run into trouble. Do we have examples where problems like this actually popped up with the current EESSI pilot repositories? Or, can we come up with a system where the CPu detection currently results in picking modules that fail to run on that system?

That said, this is probably going to turn into a bikeshedding discussion. There will always be pros and contras of whatever approach we take, I think; there's no clear technically best solution probably...

tgamblin commented 2 years ago

Note that the levels aren't made up -- they're supposed to be a compatibility standard. They come from the glibc team:

https://www.phoronix.com/news/Linux-x86-64-Feature-Levels

and they are supported by clang and gcc:

We have backported them in archspec to other compilers using flag combinations. Given that most industry compilers are downstream clang, I expect to see them more widely supported in the future (if not already). So I wouldn't deviate from the standard names.

boegel commented 2 years ago

These levels being standard doesn't mean they're fine-grained enough... You know as well as I do that the jump from only using SSE4 instructions to AVX+AVX2, without a step in the middle for AVX-only, is a pretty significant jump with performance in mind. That said, these levels are also no fine-grained enough in various ways in the context of EESSI.

It's perfectly fine that archspec supports them, since they do indeed seem to be some form of standard, but to me it seems like they're not what we need.

Also, these are only for x86_64. Are there equivalents for other CPU architectures, in particular Arm and RISC-V?

Ideally, we have a mechanism that works across different CPU architectures, like using code names of microarchitectures (haswell, zen2, neoverse-n1, rv64gc, etc.).

tgamblin commented 2 years ago

For ARM the generics I would use would be the armv8.1, etc. names, along with the uarch names as we do in archspec. I know the generics are not fine-grained. That's why we also have the uarch names. The relationships (currently) are shown in arch-all.pdf.

You can see that the ARM side needs fleshing out with way less specific names, while x86_64 has both.

hmeiland commented 2 years ago

One of my ideas behind archdetect was to simplify from ideal/perfect back to good-enough for the job required for EESSI; this implementation will most likely out-live us in the next decade (that is only ~2 generations of clusters, or 2-3 generations of micro-architectures). Detecting the x86 architectures is "simple": all the required info is mature and clearly defined in the cpu flags and readable through e.g. /proc. Using the 'common' microarchitecture names allows us to clearly talk about them. Adding new features such as gpu detection will most likely add more value than trying to optimize here...

EESSI / software-layer

target more generic CPU micro-architectures #188