EESSI / software-layer

Software layer of the EESSI project
https://eessi.github.io/docs/software_layer
GNU General Public License v2.0
20 stars 43 forks source link

Building `LAMMPS-2Aug2023_update2-foss-2023a-kokkos.eb` may fail for `--optarch=GENERIC` #545

Open trz42 opened 2 months ago

trz42 commented 2 months ago

In https://github.com/NorESSI/software-layer/pull/323 building failed for aarch64/generic. The build job was run on a compute node with ThunderX2 CPU. kokkos_arch was not explicitly set.

If kokkos_arch is not explicitly, the LAMMPS easyblock tries to determine the CPU architecture of the host. See https://github.com/easybuilders/easybuild-easyblocks/blob/develop/easybuild/easyblocks/l/lammps.py#L577-L595

    if kokkos_arch:
        if kokkos_arch not in KOKKOS_CPU_ARCH_LIST:
            warning_msg = "Specified CPU ARCH (%s) " % kokkos_arch
            warning_msg += "was not found in listed options [%s]." % KOKKOS_CPU_ARCH_LIST
            warning_msg += "Still might work though."
            print_warning(warning_msg)
        processor_arch = kokkos_arch

    else:
        warning_msg = "kokkos_arch not set. Trying to auto-detect CPU arch."
        print_warning(warning_msg)

        processor_arch = kokkos_cpu_mapping.get(get_cpu_arch())

        if not processor_arch:
            error_msg = "Couldn't determine CPU architecture, you need to set 'kokkos_arch' manually."
            raise EasyBuildError(error_msg)

        print_msg("Determined cpu arch: %s" % processor_arch)

It runs get_cpu_arch() which uses archspec with

python -c 'from archspec.cpu import host; print(host())'

archspec returned thundex2 in the PR for NESSI. When running this for EESSI it would return neoverse_n1 at the moment (compute node used to build for aarch64/generic on AWS has a neoverse_n1 CPU). The CPU architecture is then mapped via

processor_arch = kokkos_cpu_mapping.get(get_cpu_arch())

to an architecture identifier used in Kokkos. This works for EESSI, because https://github.com/easybuilders/easybuild-easyblocks/pull/3036/files#diff-bdb538abf869738e5431974debc2503a1b160370b86938bfc02729de69d5689b dynamically adds a mapping for neoverse_n1. For thunderx2 such a mapping is missing.

However, in case we would map to the correct value for thunderx2 (probably ARMV81) the built software may not function correctly on an aarch64/generic CPU, for example, a Raspberry Pi 3/4.

In https://github.com/NorESSI/software-layer/pull/323, we therefore opted to extend an existing parse_hook to set kokkos_arch to ARMV80 when we build for aarch64 and the build option optarch is set to GENERIC.

Possibly this explicit setting of kokkos_arch may need to be done too when building for x86_64/generic.

ocaisa commented 1 month ago

Given that we have shifted to archdetect within EESSI, I agree the implication here is that we should probably always be setting the kokkos_arch for the generic cases (or indeed any case where we are not doing a native build)