EESSI / software-layer

Software layer of the EESSI project
https://eessi.github.io/docs/software_layer
GNU General Public License v2.0
20 stars 43 forks source link

Notes on kickstarting the RISC-V software layer #552

Open bedroge opened 2 months ago

bedroge commented 2 months ago

With a compatibility layer (https://github.com/EESSI/compatibility-layer/pull/204) and software build container (https://github.com/EESSI/filesystem-layer/pull/132 and https://github.com/orgs/EESSI/packages/container/package/build-node) in place, we are ready to start working on a RISC-V software layer. In this issue we can keep track/notes of the work being done and issues that we encounter.

bedroge commented 2 months ago

The repository that we use is /cvmfs/riscv.eessi.io, added in https://github.com/EESSI/filesystem-layer/pull/181. The structure is the same as in /cvmfs/software.eessi.io.

For now we first focus on generic builds (added to easybuild in https://github.com/easybuilders/easybuild-framework/pull/4489). Flags for optimized builds are still lacking, see https://github.com/easybuilders/easybuild-framework/blob/develop/easybuild/toolchains/compiler/gcc.py#L82.

bedroge commented 2 months ago

In order to get EasyBuild installed, I've used the following:

singularity build --sandbox /nvme/build-container docker://ghcr.io/eessi/build-node:debian-sid
EESSI_CVMFS_REPO_OVERRIDE=/cvmfs/riscv.eessi.io ./eessi_container.sh -c /nvme/build-container --access rw
/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/startprefix
git clone https://github.com/EESSI/software-layer
cd software-layer
wget https://github.com/EESSI/software-layer/pull/537.diff
export EESSI_CVMFS_REPO_OVERRIDE=/cvmfs/riscv.eessi.io EESSI_VERSION_OVERRIDE=20240402 EESSI_SOFTWARE_SUBDIR_OVERRIDE=riscv64/generic
./EESSI-install-software.sh

We explicitly override some variables to reflect the new repo/version/CPU target, and then it sort of mimics what the bot would do by taking the diff file from https://github.com/EESSI/software-layer/pull/537 and running the install script. This worked perfectly fine. :tada:

bedroge commented 2 months ago

Now EasyBuild is available in the repo, one could easily start trying to build additional software interactively:

# Launch the container
EESSI_CVMFS_REPO_OVERRIDE=/cvmfs/riscv.eessi.io ./eessi_container.sh -c docker://ghcr.io/eessi/build-node:debian-sid --access rw

# Start a prefix shell in the container:
/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/startprefix

# EESSI init
export EESSI_CVMFS_REPO_OVERRIDE=/cvmfs/riscv.eessi.io EESSI_VERSION_OVERRIDE=20240402 EESSI_SOFTWARE_SUBDIR_OVERRIDE=riscv64/generic
source /cvmfs/riscv.eessi.io/versions/20240402/init/bash

# Set up EB and start a build
git clone https://github.com/EESSI/software-layer
cd software-layer
export WORKDIR=/tmp/eb
source configure_easybuild
module load EasyBuild
eb --optarch=GENERIC -r foss-2023b.eb
bedroge commented 2 months ago

As a first attempt, I tried building GCC 13.2.0, but that failed due to the hook that sets up a wrapper for ld. It uses config.guess to determine the system type, and this returns risc64-unknown-linux-gnu. It will then look for riscv64-unknown-linux-gnu-ld* in $EPREFIX/usr/bin, but Gentoo was built with CHOST = riscv64-pc-linux-gnu, so the binaries also use that in their filenames.

I've opened a PR at the Gentoo repo to change the CHOST: https://github.com/gentoo/gentoo/pull/36353.

Meanwhile I worked around the issue by hardcoding it in the hook to:

cmd_prefix = 'riscv64-pc-linux-gnu-'

Furthermore, ld.gold has to be removed in the next line for cmd in ('ld', 'ld.gold', 'ld.bfd'):, since we don't have ld.gold in our RISC-V compat layer.

With these small changes I could successfully build GCC 13.2.0 (not ingested yet).

bedroge commented 2 months ago

FFTW fails due to:

checking for sinq in -lquadmath... no
configure: error: quad precision requires libquadmath for quad-precision trigonometric routines

Looks like our GCC doesn't include libquadmath, I suppose it doesn't work on RISC-V (?). This Fedora page has a message enable support for riscv64, so maybe we need GCC 14. For now we could try building FFTW without it.

edit: I was checking the FFTW easyblock, and I found that this is already disabled for Arm and PowerPC, so we should make a PR to do the same for RISC-V: https://github.com/easybuilders/easybuild-easyblocks/blob/develop/easybuild/easyblocks/f/fftw.py#L143

edit2: PR created: https://github.com/easybuilders/easybuild-easyblocks/pull/3314

bedroge commented 2 months ago

When trying to build foss 2023b, I ran into the next issue with UCX, which has an outdated config.guess:

checking build system type... ./config.guess: unable to guess system type

This script, last modified 2013-06-10, has failed to recognize
the operating system you are using. It is advised that you
download the most up to date version of the config scripts from

  http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD
and
  http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD

If the version you run (./config.guess) is already up to date, please
send the following data and any information you think might be
pertinent to <config-patches@gnu.org> in order to provide the needed
information to handle your system.

config.guess timestamp = 2013-06-10

uname -m = riscv64
uname -r = 5.15.0-starfive
uname -s = Linux
uname -v = #1 SMP Fri Nov 24 07:22:28 UTC 2023

/usr/bin/uname -p = unknown
/bin/uname -X     = 

hostinfo               = 
/bin/universe          = 
/usr/bin/arch -k       = 
/bin/arch              = riscv64
/usr/bin/oslevel       = 
/usr/convex/getsysinfo = 

UNAME_MACHINE = riscv64
UNAME_RELEASE = 5.15.0-starfive
UNAME_SYSTEM  = Linux
UNAME_VERSION = #1 SMP Fri Nov 24 07:22:28 UTC 2023
configure: error: cannot guess build type; you must specify one

So we need to patch this by providing a newer version of config.guess before the configure step.

edit: I worked around the issue by using a hook that copies EB's config.guess to the UCX build dir:

        config_guess_path = self.obtain_config_guess()
        copy_file(config_guess_path, self.start_dir)

This allows the configure step to complete, but the build fails almost immediately due to:

/tmp/eb/easybuild/build/UCX/1.15.0/GCCcore-13.2.0/ucx-1.15.0/src/ucm/bistro/bistro.h:24:4: error: #error "Unsupported architecture"
   24 | #  error "Unsupported architecture"
      |    ^~~~~

edit2: looks like RISC-V support was added in UCX 1.16.0 (which was released 10 days ago).

bedroge commented 2 months ago

The config.guess issue would normally be solved by EB itself, but it's not happening for UCX, because that easyconfig is using a wrapper script around ./configure. This PR changes it, which should solve the issue: https://github.com/easybuilders/easybuild-easyconfigs/pull/20428.

I also have a patch that backports RISC-V support into UCX 1.15.0: https://github.com/easybuilders/easybuild-easyconfigs/pull/20429.

bedroge commented 2 months ago

Next issue: the foss 2023b toolchain has UCC 1.2.0, but RISC-V support was only added in 1.3.0: https://github.com/openucx/ucc/pull/829. The diff is quite small, so it should be easy to backport this to 1.2.0.

Edit: solved in PR https://github.com/easybuilders/easybuild-easyconfigs/pull/20432.

bedroge commented 2 months ago

BLIS 0.9.0 fails in the configure step:

configure: automatic configuration requested.
/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/bin/ld: /tmp/eb-oaju2ohj/cc7gwuui.o: in function `main':
config_detect.c:(.text+0x2aa): undefined reference to `bli_cpuid_query_id'
collect2: error: ld returned 1 exit status
./configure: line 1212: ./auto-detect.x: No such file or directory
configure: hardware detection driver returned ''.
configure: checking configuration against contents of 'config_registry'.
configure: 'auto-detected configuration '' is NOT registered!
configure: 
configure: *** Cannot continue with unregistered configuration ''. ***
configure: 

There are some BLIS PRs related to adding RISC-V functionality, so I'll have a look at those.

bedroge commented 2 months ago

Backported RISC-V support to BLIS 0.9.0: https://github.com/easybuilders/easybuild-easyconfigs/pull/20468.

OpenBLAS also built without any issues, so we're getting really close to having a full foss/2023b toolchain.

bedroge commented 2 months ago

FlexiBLAS and ScaLAPACK also installed without issues, so we now have foss/2023b!

bedroge commented 1 month ago

R 4.3.3 is now available as well. It required some (small) changes in the easyblocks/easyconfigs of Mesa, LLVM, and Java. I'll open PRs for those and list them here.

RISC-V support for Java: https://github.com/easybuilders/easybuild-easyblocks/pull/3323 https://github.com/easybuilders/easybuild-easyconfigs/pull/20495

RISC-V support for Mesa: https://github.com/easybuilders/easybuild-easyblocks/pull/3324

RISC-V support for LLVM: https://github.com/easybuilders/easybuild-easyblocks/pull/3325

In order to replace the dependency on Java 11 by Java 21, I used the following hook:

def parse_hook_use_newer_java(ec, *args, **kwargs):
    if ec.name == 'R' and ec.version in ['4.3.3'] and get_cpu_family() == RISCV:
        deps = ec['dependencies']
        java_dep = None
        java_name, java_version = ('Java', '11')
        for idx, dep in enumerate(deps):
            if dep[0] == java_name and dep[1] == java_version:
                java_dep = dep
                break
        if java_dep:
            deps[idx] = ('Java', '21', '', SYSTEM)
julianmorillo commented 1 month ago

dlb (https://pm.bsc.es/dlb) built without issues. Attached is the corresponding tar file. eessi-20240402-software-linux-riscv64-generic-1715088854.tar.gz

bedroge commented 1 month ago

While trying to install GROMACS, I ran into issues with its dependency SciPy-bundle, some numpy tests fail:

FAILED core/tests/test_numeric.py::TestBoolCmp::test_float - AssertionError: 
FAILED core/tests/test_umath.py::TestFPClass::test_fpclass[-4] - AssertionError: 
FAILED core/tests/test_umath.py::TestFPClass::test_fpclass[-2] - AssertionError: 
FAILED core/tests/test_umath.py::TestFPClass::test_fpclass[-1] - AssertionError: 
FAILED core/tests/test_umath.py::TestFPClass::test_fpclass[1] - AssertionError: 
FAILED core/tests/test_umath.py::TestFPClass::test_fp_noncontiguous[f] - AssertionError: 
===== 6 failed, 33239 passed, 943 skipped, 1303 deselected, 31 xfailed, 3 xpassed, 58 warnings in 1640.83s (0:27:20) =====

I found https://github.com/numpy/numpy/pull/25246 which disables most of these on RISC-V, so for now I've ignored the test failures. Now GROMACS itself is failing in the test step as well:

99% tests passed, 1 tests failed out of 91

Label Time Summary:
GTest              = 759.58 sec*proc (87 tests)
IntegrationTest    = 285.44 sec*proc (30 tests)
MpiTest            = 420.83 sec*proc (23 tests)
QuickGpuTest       =  83.55 sec*proc (20 tests)
SlowGpuTest        = 493.55 sec*proc (14 tests)
SlowTest           = 392.31 sec*proc (13 tests)
UnitTest           =  81.82 sec*proc (44 tests)

Total Test time (real) = 760.26 sec

The following tests FAILED:
          2 - GmxapiMpiTests (Failed)

Full output of the failing test:

starting mdrun 'Water and methane'
4 steps,      0.0 ps (continuing from step 2,      0.0 ps).
[starfive:369549:0:369549] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[starfive:369548:0:369558] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 369558) ====
 0  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/UCX/1.15.0-GCCcore-13.2.0/lib64/libucs.so.0(ucs_handle_error+0x1fc) [0x3f9edc8044]
 1  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/UCX/1.15.0-GCCcore-13.2.0/lib64/libucs.so.0(+0x2111e) [0x3f9edc811e]
 2  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/UCX/1.15.0-GCCcore-13.2.0/lib64/libucs.so.0(+0x21280) [0x3f9edc8280]
 3  linux-vdso.so.1(__vdso_rt_sigreturn+0) [0x3fac463800]
 4  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(_Z35nbnxn_kernel_ElecRF_VdwLJ_VgrpF_refPK16NbnxnPairlistCpuPK16nbnxn_atomdata_tPK19interaction_const_tPA3_KdP23nbnxn_atomdata_output_t+0x1ebc) [0x3fab7c1b74]
 5  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(+0x2a98cc) [0x3fab7ba8cc]
 6  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GCCcore/13.2.0/lib64/libgomp.so.1(+0x19d38) [0x3fab105d38]
 7  /cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/lib64/lp64d/libc.so.6(+0x6b0f4) [0x3faafe20f4]
 8  /cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/lib64/lp64d/libc.so.6(+0xb6da0) [0x3fab02dda0]
=================================
[starfive:369548] *** Process received signal ***
[starfive:369548] Signal: Segmentation fault (11)
[starfive:369548] Signal code:  (-6)
[starfive:369548] Failing at address: 0x3e80005a38c
[starfive:369548] [ 0] linux-vdso.so.1(__vdso_rt_sigreturn+0x0)[0x3fac463800]
[starfive:369548] [ 1] /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(_Z35nbnxn_kernel_ElecRF_VdwLJ_VgrpF_refPK16NbnxnPairlistCpuPK16nbnxn_atomdata_tPK19interaction_const_tPA3_KdP23nbnxn_atomdata_output_t+0x1ebc)[0x3fab7c1b74]
[starfive:369548] [ 2] /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(+0x2a98cc)[0x3fab7ba8cc]
[starfive:369548] [ 3] /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GCCcore/13.2.0/lib64/libgomp.so.1(+0x19d38)[0x3fab105d38]
[starfive:369548] [ 4] /cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/lib64/lp64d/libc.so.6(+0x6b0f4)[0x3faafe20f4]
[starfive:369548] [ 5] /cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/lib64/lp64d/libc.so.6(+0xb6da0)[0x3fab02dda0]
[starfive:369548] *** End of error message ***
==== backtrace (tid: 369549) ====
 0  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/UCX/1.15.0-GCCcore-13.2.0/lib64/libucs.so.0(ucs_handle_error+0x1fc) [0x3f7cb9c044]
 1  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/UCX/1.15.0-GCCcore-13.2.0/lib64/libucs.so.0(+0x2111e) [0x3f7cb9c11e]
 2  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/UCX/1.15.0-GCCcore-13.2.0/lib64/libucs.so.0(+0x21280) [0x3f7cb9c280]
 3  linux-vdso.so.1(__vdso_rt_sigreturn+0) [0x3f86236800]
 4  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(_Z35nbnxn_kernel_ElecRF_VdwLJ_VgrpF_refPK16NbnxnPairlistCpuPK16nbnxn_atomdata_tPK19interaction_const_tPA3_KdP23nbnxn_atomdata_output_t+0x1ebc) [0x3f85594b74]
 5  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(+0x2a98cc) [0x3f8558d8cc]
 6  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GCCcore/13.2.0/lib64/libgomp.so.1(GOMP_parallel+0x38) [0x3f84ed19c4]
 7  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(_ZNK18nonbonded_verlet_t23dispatchNonbondedKernelEN3gmx19InteractionLocalityERK19interaction_const_tRKNS0_12StepWorkloadEiNS0_8ArrayRefIKNS0_11BasicVectorIdEEEENS8_IdEESD_P6t_nrnb+0xd4) [0x3f8558e146]
 8  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(+0x7e3556) [0x3f85ac7556]
 9  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(_Z8do_forceP8_IO_FILEPK9t_commrecPK14gmx_multisim_tRK10t_inputrecRKN3gmx18MDModulesNotifiersEPNSA_3AwhEP10gmx_enfrotPNSA_10ImdSessionEP6pull_tlP6t_nrnbP13gmx_wallcyclePK14gmx_localtop_tPA3_KdNSA_19ArrayRefWithPaddingINSA_11BasicVectorIdEEEENSA_8ArrayRefISY_EEPK9history_tPNSA_16ForceBuffersViewEPA3_dPK9t_mdatomsP14gmx_enerdata_tNS10_IST_EEP10t_forcerecRKNSA_21MdrunScheduleWorkloadEPNSA_19VirtualSitesHandlerEPddP9gmx_edsamP24CpuPpLongRangeNonbondedsRK22DDBalanceRegionHandler+0xdf0) [0x3f85ac9970]
10  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(_ZN3gmx15LegacySimulator5do_mdEv+0x39da) [0x3f85bc1870]
11  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(_ZN3gmx8Mdrunner8mdrunnerEv+0x6e60) [0x3f85bebcb6]
12  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgmxapi_mpi_d.so.0(_ZN6gmxapi11SessionImpl3runEv+0x18) [0x3f8621870e]
13  /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgmxapi_mpi_d.so.0(_ZN6gmxapi7Session3runEv+0xe) [0x3f86218854]
14  /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/gmxapi-mpi-test() [0x2ebd2]
15  /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/../lib/libgtest.so.1.13.0(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x30) [0x3f852b17da]
16  /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/../lib/libgtest.so.1.13.0(_ZN7testing4Test3RunEv+0xc2) [0x3f852a26fa]
17  /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/../lib/libgtest.so.1.13.0(_ZN7testing8TestInfo3RunEv+0x11c) [0x3f852a2824]
18  /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/../lib/libgtest.so.1.13.0(_ZN7testing9TestSuite3RunEv+0xbc) [0x3f852a28ea]
19  /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/../lib/libgtest.so.1.13.0(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x1fa) [0x3f852ab23e]
20  /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/../lib/libgtest.so.1.13.0(_ZN7testing8UnitTest3RunEv+0x52) [0x3f852a2a46]
21  /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/gmxapi-mpi-test() [0x26dbe]
22  /cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/lib64/lp64d/libc.so.6(+0x27688) [0x3f84d71688]
23  /cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/lib64/lp64d/libc.so.6(__libc_start_main+0x74) [0x3f84d71730]
24  /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/gmxapi-mpi-test() [0x26fd8]
=================================
[starfive:369549] *** Process received signal ***
[starfive:369549] Signal: Segmentation fault (11)
[starfive:369549] Signal code:  (-6)
[starfive:369549] Failing at address: 0x3e80005a38d
[starfive:369549] [ 0] linux-vdso.so.1(__vdso_rt_sigreturn+0x0)[0x3f86236800]
[starfive:369549] [ 1] /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(_Z35nbnxn_kernel_ElecRF_VdwLJ_VgrpF_refPK16NbnxnPairlistCpuPK16nbnxn_atomdata_tPK19interaction_const_tPA3_KdP23nbnxn_atomdata_output_t+0x1ebc)[0x3f85594b74]
[starfive:369549] [ 2] /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(+0x2a98cc)[0x3f8558d8cc]
[starfive:369549] [ 3] /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GCCcore/13.2.0/lib64/libgomp.so.1(GOMP_parallel+0x38)[0x3f84ed19c4]
[starfive:369549] [ 4] /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(_ZNK18nonbonded_verlet_t23dispatchNonbondedKernelEN3gmx19InteractionLocalityERK19interaction_const_tRKNS0_12StepWorkloadEiNS0_8ArrayRefIKNS0_11BasicVectorIdEEEENS8_IdEESD_P6t_nrnb+0xd4)[0x3f8558e146]
[starfive:369549] [ 5] /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(+0x7e3556)[0x3f85ac7556]
[starfive:369549] [ 6] /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(_Z8do_forceP8_IO_FILEPK9t_commrecPK14gmx_multisim_tRK10t_inputrecRKN3gmx18MDModulesNotifiersEPNSA_3AwhEP10gmx_enfrotPNSA_10ImdSessionEP6pull_tlP6t_nrnbP13gmx_wallcyclePK14gmx_localtop_tPA3_KdNSA_19ArrayRefWithPaddingINSA_11BasicVectorIdEEEENSA_8ArrayRefISY_EEPK9history_tPNSA_16ForceBuffersViewEPA3_dPK9t_mdatomsP14gmx_enerdata_tNS10_IST_EEP10t_forcerecRKNSA_21MdrunScheduleWorkloadEPNSA_19VirtualSitesHandlerEPddP9gmx_edsamP24CpuPpLongRangeNonbondedsRK22DDBalanceRegionHandler+0xdf0)[0x3f85ac9970]
[starfive:369549] [ 7] /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(_ZN3gmx15LegacySimulator5do_mdEv+0x39da)[0x3f85bc1870]
[starfive:369549] [ 8] /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgromacs_mpi_d.so.9(_ZN3gmx8Mdrunner8mdrunnerEv+0x6e60)[0x3f85bebcb6]
[starfive:369549] [ 9] /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgmxapi_mpi_d.so.0(_ZN6gmxapi11SessionImpl3runEv+0x18)[0x3f8621870e]
[starfive:369549] [10] /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/GROMACS/2024.1-foss-2023b/lib/libgmxapi_mpi_d.so.0(_ZN6gmxapi7Session3runEv+0xe)[0x3f86218854]
[starfive:369549] [11] /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/gmxapi-mpi-test[0x2ebd2]
[starfive:369549] [12] /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/../lib/libgtest.so.1.13.0(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x30)[0x3f852b17da]
[starfive:369549] [13] /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/../lib/libgtest.so.1.13.0(_ZN7testing4Test3RunEv+0xc2)[0x3f852a26fa]
[starfive:369549] [14] /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/../lib/libgtest.so.1.13.0(_ZN7testing8TestInfo3RunEv+0x11c)[0x3f852a2824]
[starfive:369549] [15] /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/../lib/libgtest.so.1.13.0(_ZN7testing9TestSuite3RunEv+0xbc)[0x3f852a28ea]
[starfive:369549] [16] /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/../lib/libgtest.so.1.13.0(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x1fa)[0x3f852ab23e]
[starfive:369549] [17] /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/../lib/libgtest.so.1.13.0(_ZN7testing8UnitTest3RunEv+0x52)[0x3f852a2a46]
[starfive:369549] [18] /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/gmxapi-mpi-test[0x26dbe]
[starfive:369549] [19] /cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/lib64/lp64d/libc.so.6(+0x27688)[0x3f84d71688]
[starfive:369549] [20] /cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/lib64/lp64d/libc.so.6(__libc_start_main+0x74)[0x3f84d71730]
[starfive:369549] [21] /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/bin/gmxapi-mpi-test[0x26fd8]
[starfive:369549] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node starfive exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
boegel commented 1 month ago

@bedroge Could that be simply due to insufficient memory on your SiFive Unmatched board?

bedroge commented 1 month ago

@bedroge Could that be simply due to insufficient memory on your SiFive Unmatched Starfive VisionFive 2 board?

I don't know, didn't see any Killed / OOM messages.

I tried again, this time using the slightly modified easyconfig from https://github.com/easybuilders/easybuild-easyconfigs/pull/20522, and then it failed in the second iteration:

Reading file /tmp/eb/easybuild/build/GROMACS/2024.1/foss-2023b/easybuild_obj/api/gmxapi/cpp/tests/Testing/Temporary/GmxApiTest_RunnerChainedMD.tpr, VERSION 2024.1-EasyBuild_4.9.1 (single precision)

-------------------------------------------------------
Program:     gmxapi-mpi-test, version 2024.1-EasyBuild_4.9.1
Source file: src/gromacs/utility/keyvaluetreeserializer.cpp (line 302)
Function:    gmx::{anonymous}::ValueSerializer::deserialize(gmx::ISerializer*)::<lambda()>
MPI rank:    0 (out of 2)

Assertion failed:
Condition: iter != s_deserializers.end()
Unknown type tag for deserializization

I don't have a clue what that's about, so I just did another attempt, and then the installation completed successfully (all tests of all four iterations passed) 🎉 🤷‍♂️

bedroge commented 1 month ago

GMP easyconfigs have precise: True in toolchainopts, but that doesn't work on RISC-V: the EB framework sets -mno-recip in this case (see https://github.com/easybuilders/easybuild-framework/blob/develop/easybuild/toolchains/compiler/gcc.py#L66C22-L66C31), but that's not supported on RISC-V. Neither on Arm, so there it's overridden to some other flags: https://github.com/easybuilders/easybuild-framework/blob/develop/easybuild/toolchains/compiler/gcc.py#L77 But those are not available on RISC-V either. It doesn't seem like there's a good alternative, but @julianmorillo is going to check with a compiler expert. Meanwhile I tried building without precise: True, and that worked fine. Also the test step completed without issues.

Feedback from Julian:

already talked with the compiler guy, it looks like the flag we need is -fno-reciprocal-math : https://gcc.gnu.org/onlinedocs/gcc-14.1.0/gcc/Optimize-Options.html#index-freciprocal-math I only have two concerns with it: first one is that it is generic, so not sure why they are not using it for Intel or for ARM instead of the specific ones (or even why the specifics ones exist). and secondly, -fno-reciprocal-math is the default behaviour, so no need to put it explicitly (unless they are using also -Ofast or -Ofast-math ?)

bedroge commented 1 month ago

With x264 I'm running into an outdated config.guess issue once again. Here the problem is that its configure script is apparently handcrafted, and hence it doesn't contain the string that Easybuild uses to determine if this was generated with Autoconf (see https://github.com/easybuilders/easybuild-easyblocks/blob/develop/easybuild/easyblocks/generic/configuremake.py#L57). If that's not there, EB will not replace the config.guess with a newer one (see https://github.com/easybuilders/easybuild-easyblocks/blob/develop/easybuild/easyblocks/generic/configuremake.py#L303). So we probably have to do that manually in the easyconfig or with a hook.

edit: the same hook that I used before works fine and allows the installation to complete:

def pre_configure_hook_x264(self, *args, **kwargs):
    if self.name == 'x264' and self.version in ['20231019'] and get_cpu_architecture() == RISCV64:
        config_guess_path = self.obtain_config_guess()
        copy_file(config_guess_path, self.start_dir)
bedroge commented 1 month ago

And almost the same happens with LAME: it looks like the configure_cmd_prefix (here: https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/l/LAME/LAME-3.100-GCCcore-13.2.0.eb#L29) breaks the os.path.exists(configure_command) in the easyblock, which makes it fail to recognize that this actually is an Autoconf-generated configure script. Or is it because it's running autoreconf in preconfigopts? Either way, the config.guess still doesn't get updated, but the same hook works for this one as well.

julianmorillo commented 1 month ago

libdwarf-0.9.2 installed. This is the corresponding tar file to be ingested: eessi-20240402-software-linux-riscv64-generic-1716472182.tar.gz

bedroge commented 1 month ago

With x265 I ran into the following issue:

/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/bin/ld: encoder/CMakeFiles/encoder.dir/analysis.cpp.o: relocation R_RISCV_HI20 against `_ZN4x26510g_log2SizeE' can not be used when making a shared object; recompile with -fPIC
/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/bin/ld: encoder/CMakeFiles/encoder.dir/search.cpp.o: relocation R_RISCV_HI20 against `a local symbol' can not be used when making a shared object; recompile with -fPIC
/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/bin/ld: encoder/CMakeFiles/encoder.dir/bitcost.cpp.o: relocation R_RISCV_HI20 against `a local symbol' can not be used when making a shared object; recompile with -fPIC

<SNIP, more of those....>

/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/bin/ld: common/CMakeFiles/common.dir/deblock.cpp.o: relocation R_RISCV_HI20 against `_ZN4x26515g_zscanToRasterE' can not be used when making a shared object; recompile with -fPIC
/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/bin/ld: common/CMakeFiles/common.dir/scaler.cpp.o: relocation R_RISCV_HI20 against `_ZTVN4x26512ScalerFilterE' can not be used when making a shared object; recompile with -fPIC
/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/bin/ld: unresolvable R_RISCV_CALL_PLT relocation against symbol `log@@GLIBC_2.27'
/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/bin/ld: unresolvable R_RISCV_CALL_PLT relocation against symbol `__cxa_atexit@@GLIBC_2.27'
/tmp/eb-xqbykceq/tmp9gre4wcp/rpath_wrappers/ld_wrapper/ld: line 69: 234648 Segmentation fault      /cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/bin/ld "${CMD_ARGS[@]}"
collect2: error: ld returned 139 exit status

This can be solved by adding -DENABLE_PIC=ON to the configopts (found that in the Gentoo ebuild file: https://github.com/gentoo/gentoo/blob/master/media-libs/x265/x265-3.5-r2.ebuild).

julianmorillo commented 1 month ago

Boost-1.83.0-GCC-13.2.0 has already been installed (this is the last Extrae dependency). The corresponding TAR file can be downloaded here: https://b2drop.bsc.es/index.php/s/Q3rMCXGX4r4SePQ

bedroge commented 1 month ago

FFmpeg failed because of:

AR      libavcodec/libavcodec.a
HOSTLD  doc/print_options
LD      libavutil/libavutil.so.58
GENTEXI doc/avoptions_format.texi
GENTEXI doc/avoptions_codec.texi
HTML    doc/ffmpeg.html
makeinfo: error parsing ./doc/t2h.pm: Undefined subroutine &Texinfo::Config::set_from_init_file called at ./doc/t2h.pm line 24.
make: *** [doc/Makefile:70: doc/ffmpeg.html] Error 1
make: *** Waiting for unfinished jobs....

It looks like it needs texinfo for building the html pages, but this is not listed as dependency in the easyconfig. We do have texinfo in the compat layer, but version 7.1, and apparently that version has issues: https://groups.google.com/g/linux.debian.bugs.dist/c/1f_eeuQd_2U The compat layers of x86_64 and aarch64 have texinfo 7.0.3, which explains why we haven't seen the same issue there.

This should be fixed upstream by adding texinfo as dependency, or, preferably, adding--disable-htmlpages to the configopts (I've tested this and it allowed the installation to complete).

edit: done in https://github.com/easybuilders/easybuild-easyconfigs/pull/20686.

julianmorillo commented 1 month ago

Installation of Extrae is giving me this error:

checking for binutils... notfound
configure: libbfd library directory: /usr/lib/riscv64-linux-gnu
configure: Warning! Cannot find the libiberty library in the given binutils home. Please, make sure that the binutils packages is correctly installed. If you have installed the binutils package by hand from their source code, make sure that libiberty is installed. Some releases of the binutils package do not install the libibery even invoking make install. The library should be within the libiberty directory within the binutils source tree.
checking for bfd.h... no
configure: error: You have asked to gather call-site information through --with-unwind which must be translated using binutils, but either libbfd or libiberty are not found. Please make sure that the binutils-dev package is installed and specify where to find these libraries through --with-binutils. The latest source can be downloaded from http://www.gnu.org/software/binutils
 (at easybuild/tools/run.py:682 in parse_cmd_output)
== 2024-05-29 16:41:10,208 build_log.py:267 INFO ... (took 4 mins 22 secs)
== 2024-05-29 16:41:10,231 config.py:699 DEBUG software install path as specified by 'installpath' and 'subdir_software': /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software
== 2024-05-29 16:41:10,232 filetools.py:2013 INFO Removing lock /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/.locks/_cvmfs_riscv.eessi.io_versions_20240402_software_linux_riscv64_generic_software_Extrae_4.1.5-gompi-2023b.lock...
== 2024-05-29 16:41:10,235 filetools.py:383 INFO Path /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/.locks/_cvmfs_riscv.eessi.io_versions_20240402_software_linux_riscv64_generic_software_Extrae_4.1.5-gompi-2023b.lock successfully removed.
== 2024-05-29 16:41:10,236 filetools.py:2017 INFO Lock removed: /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/.locks/_cvmfs_riscv.eessi.io_versions_20240402_software_linux_riscv64_generic_software_Extrae_4.1.5-gompi-2023b.lock
== 2024-05-29 16:41:10,237 easyblock.py:4291 WARNING build failed (first 300 chars): cmd " ./configure --prefix=/cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/Extrae/4.1.5-gompi-2023b  --build=riscv64-unknown-linux-gnu  --host=riscv64-unknown-linux-gnu  --with-mpi=/cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/OpenMPI
== 2024-05-29 16:41:10,239 easyblock.py:328 INFO Closing log for application name Extrae version 4.1.5

I'm trying now to install binutils (although I thought it was already provided by the compat layer).

ocaisa commented 1 month ago

@julianmorillo See https://github.com/EESSI/software-layer/pull/554#issuecomment-2099376096

You probably need the full hook in https://github.com/EESSI/software-layer/pull/554/commits/41149ac060b7580f2b15d3e04908ffabe207e046

julianmorillo commented 1 month ago

Thanks, @ocaisa!!
Yes, both the hook and a patch are needed. I have submitted a PR with such a patch: https://github.com/easybuilders/easybuild-easyconfigs/pull/20690

julianmorillo commented 1 month ago

@bedroge , could we add this hook https://github.com/EESSI/software-layer/commit/41149ac060b7580f2b15d3e04908ffabe207e046 to the riscv.eessi.io software-layer?

bedroge commented 1 month ago

@bedroge , could we add this hook 41149ac to the riscv.eessi.io software-layer?

The hooks file is being stored on github (it will be picked up by EasyBuild when doing the actual builds), so we just needs the PR from @boegel being merged in order to have it available. But feel free to already use it locally for your Extrae builds.

julianmorillo commented 1 month ago

I have just done a PR: https://github.com/easybuilders/easybuild-easyblocks/pull/3339 Regarding Extrae:

julianmorillo commented 1 month ago

First failing tests of Extrae for RISC-V are:

make[4]: Leaving directory '/tmp/eb/easybuild/build/Extrae/4.1.6/gompi-2023b/extrae-4.1.6/tests/functional/launcher'
make[3]: Leaving directory '/tmp/eb/easybuild/build/Extrae/4.1.6/gompi-2023b/extrae-4.1.6/tests/functional/launcher'
Making check in tracer
make[3]: Entering directory '/tmp/eb/easybuild/build/Extrae/4.1.6/gompi-2023b/extrae-4.1.6/tests/functional/tracer'
Making check in OTHER
make[4]: Entering directory '/tmp/eb/easybuild/build/Extrae/4.1.6/gompi-2023b/extrae-4.1.6/tests/functional/tracer/OTHER'
make  auto-init-fini define_event_type_gen_pcf define_event_type_gen_pcf_f
make[5]: Entering directory '/tmp/eb/easybuild/build/Extrae/4.1.6/gompi-2023b/extrae-4.1.6/tests/functional/tracer/OTHER'
  CC       auto_init_fini-auto-init-fini.o
  CCLD     auto-init-fini
  CC       define_event_type_gen_pcf-define_event_type_gen_pcf.o
  CCLD     define_event_type_gen_pcf
  FC       ../../../../include/define_event_type_gen_pcf_f-extrae_module.o
  FC       define_event_type_gen_pcf_f-define_event_type_gen_pcf.o
  FCLD     define_event_type_gen_pcf_f
make[5]: Leaving directory '/tmp/eb/easybuild/build/Extrae/4.1.6/gompi-2023b/extrae-4.1.6/tests/functional/tracer/OTHER'
make  check-TESTS
make[5]: Entering directory '/tmp/eb/easybuild/build/Extrae/4.1.6/gompi-2023b/extrae-4.1.6/tests/functional/tracer/OTHER'
make[6]: Entering directory '/tmp/eb/easybuild/build/Extrae/4.1.6/gompi-2023b/extrae-4.1.6/tests/functional/tracer/OTHER'
FAIL: auto-init-fini.sh
FAIL: define_event_type_gen_pcf.sh
FAIL: define_event_type_gen_pcf_f.sh
============================================================================
Testsuite summary for Extrae 4.1.6
============================================================================
# TOTAL: 3
# PASS:  0
# SKIP:  0
# XFAIL: 0
# FAIL:  3
# XPASS: 0
# ERROR: 0
============================================================================
See tests/functional/tracer/OTHER/test-suite.log
Please report to tools@bsc.es
============================================================================
make[6]: *** [Makefile:1232: test-suite.log] Error 1
make[6]: Leaving directory '/tmp/eb/easybuild/build/Extrae/4.1.6/gompi-2023b/extrae-4.1.6/tests/functional/tracer/OTHER'
make[5]: *** [Makefile:1340: check-TESTS] Error 2
make[5]: Leaving directory '/tmp/eb/easybuild/build/Extrae/4.1.6/gompi-2023b/extrae-4.1.6/tests/functional/tracer/OTHER'
make[4]: *** [Makefile:1427: check-am] Error 2
make[4]: Target 'check' not remade because of errors.
julianmorillo commented 1 month ago

Looking into the log file (tests/functional/tracer/OTHER/test-suite.log), all three FAILS are caused by:

error while loading shared libraries: libbfd-2.42.0.gentoo-sys-devel-binutils-st.so: cannot open shared object file: No such file or directory

ocaisa commented 1 month ago

Looking at the (typical) path of this:

.../usr/lib64/binutils/x86_64-pc-linux-gnu/2.40/libbfd-2.40.0.gentoo-sys-devel-binutils-st.so

and then where our linker looks:

{EESSI 2023.06} ocaisa@LAPTOP-O6HF2IKC:~$ ld --verbose | grep SEARCH_DIR | tr -s ' ;' \\012
SEARCH_DIR("/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/x86_64-pc-linux-gnu/lib64")
SEARCH_DIR("/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/lib64/binutils/x86_64-pc-linux-gnu/2.4064")
SEARCH_DIR("/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/local/lib64")
SEARCH_DIR("/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/lib64")
SEARCH_DIR("/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/lib64")
SEARCH_DIR("/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/x86_64-pc-linux-gnu/lib")
SEARCH_DIR("/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/lib64/binutils/x86_64-pc-linux-gnu/2.40")
SEARCH_DIR("/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/local/lib")
SEARCH_DIR("/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/lib")
SEARCH_DIR("/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/lib")

this should be found, I'm guessing this is a specific issue with RISC-V? Possibly related to the string that corresponds to x86_64-pc-linux-gnu?

ocaisa commented 1 month ago
{EESSI 2023.06} ocaisa@LAPTOP-O6HF2IKC:~$ ls /cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/lib64/binutils/riscv64-pc-linux-gnu/2.42/libbfd-2.42.0.gentoo-sys-devel-binutils-st.so
/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/lib64/binutils/riscv64-pc-linux-gnu/2.42/libbfd-2.42.0.gentoo-sys-devel-binutils-st.so

The library does exist, but perhaps the linker is not configured correctly to search there? I don't have access to riscv, but can you init EESSI and run

ld --verbose | grep SEARCH_DIR | tr -s ' ;' \\012

(either that or for some reason the wrong runtime linker is being used)

bedroge commented 1 month ago
$ ld --verbose | grep SEARCH_DIR | tr -s ' ;' \\012
SEARCH_DIR("/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/riscv64-pc-linux-gnu/lib64/lp64d")
SEARCH_DIR("/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/riscv64-pc-linux-gnu/lib64")
SEARCH_DIR("/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/riscv64-pc-linux-gnu/lib6464/lp64d")
SEARCH_DIR("/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/lib64/binutils/riscv64-pc-linux-gnu/2.4264/lp64d")
SEARCH_DIR("/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/lib64/binutils/riscv64-pc-linux-gnu/2.4264")
SEARCH_DIR("/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/local/lib64/lp64d")
SEARCH_DIR("/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/local/lib64")
SEARCH_DIR("/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/lib64/lp64d")
SEARCH_DIR("/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/lib64")
SEARCH_DIR("/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/lib64/lp64d")
SEARCH_DIR("/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/lib64")
SEARCH_DIR("/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/riscv64-pc-linux-gnu/lib")
SEARCH_DIR("/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/lib64/binutils/riscv64-pc-linux-gnu/2.42")
SEARCH_DIR("/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/local/lib")
SEARCH_DIR("/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/lib")
SEARCH_DIR("/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/usr/lib")

It's actually in there (4th from the bottom of the list).

julianmorillo commented 1 month ago

I made a clean installation and this problem disappeared. Most probably I messed things up making tests.

bedroge commented 3 weeks ago

PyTorch initially failed due to:

/tmp/eb/easybuild/build/PyTorch/2.1.2/foss-2023b/pytorch-v2.1.2/third_party/sleef/src/arch/helperpurec_scalar.h:69:2: error: #error FP_FAST_FMA or FP_FAST_FMAF not defined
   69 | #error FP_FAST_FMA or FP_FAST_FMAF not defined

This is due to lacking RISC-V support in this sleef version. This has been added in newer versions, but PyTorch is still sticking to a quite old commit. By backporting the changes from https://github.com/shibatch/sleef/pull/477 (do we also need https://github.com/shibatch/sleef/pull/503?), the installation completed with only 65 test errors:

  >> command completed: exit 1, ran in 03h59m00s
== ... (took 28 hours 3 mins 56 secs)
== FAILED: Installation ended unsuccessfully (build directory: /tmp/eb/easybuild/build/PyTorch/2.1.2/foss-2023b): build failed (first 300 chars): 65 test failures, 0 test errors (out of 206539):
Failed tests (suites/files):
backends/xeon/test_launch 1/1 (1 failed, 1 passed, 2 rerun)
dynamo/test_after_aot 1/1 (1 failed, 1 passed, 2 rerun)
dynamo/test_backends 1/1 (1 failed, 9 passed, 4 skipped, 2 rerun)
dynamo/test_logging 1/1 (8 failed, 27 passed, 2 skipped, 16 rerun)
dynamo/test_misc 1/1 (7 failed, 289 passed, 9 skipped, 2 xfailed, 14 rerun)
dynamo/test_modules 1/1 (1 failed, 88 passed, 1 skipped, 2 rerun)
dynamo/test_repros 1/1 (3 failed, 132 passed, 4 skipped, 3 xfailed, 6 rerun)
dynamo/test_unspec 1/1 (2 failed, 15 passed, 1 skipped, 1 xfailed, 4 rerun)
dynamo/test_dynamic_shapes 1/1 (20 failed, 2010 passed, 52 skipped, 32 xfailed, 40 rerun)
functorch/test_eager_transforms 1/1 (1 failed, 343 passed, 3 skipped, 1 xfailed, 2 rerun)
inductor/test_config 1/1 (1 failed, 10 passed, 1 skipped, 2 rerun)
inductor/test_minifier 1/1 (5 failed, 3 skipped, 10 rerun)
inductor/test_mmdecomp 1/1 (8 failed, 17 passed, 16 rerun)
test_content_store 1/1 (3 failed, 6 rerun)
test_binary_ufuncs 1/1 (2 failed, 11793 passed, 966 skipped, 24 xfailed, 4 rerun)
test_cpp_extensions_open_device_registration 1/1 (1 failed, 2 rerun)
+ test_ops_jit 1/1
+ test_torch 1/1 (at easybuild/easyblocks/p/pytorch.py:504 in test_step)

For now, we can ignore them by using --ignore-test-failure.

bedroge commented 2 weeks ago

I tried building numpy 1.26.4 (which is part of https://github.com/easybuilders/easybuild-easyconfigs/pull/20830), but it still has the same six tests failing. I've opened an issue on the numpy github: https://github.com/numpy/numpy/issues/26734

This PR fixes the test failures: https://github.com/easybuilders/easybuild-easyconfigs/pull/20847. However, it's now running out of memory on my StarFive when doing the scipy tests :see_no_evil:

bedroge commented 1 week ago

ESPResSo 4.2.2 has been installed, it only required a toolchain bump to 2023b: https://github.com/easybuilders/easybuild-easyconfigs/pull/20878

bedroge commented 1 week ago

The installation of ReFrame 4.3.3 fails in the sanity check:

== 2024-06-21 08:00:50,559 easyblock.py:3638 WARNING Sanity check: sanity check command reframe -V exited with code 1 (output: Traceback (most recent call 
last):
  File "/cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/ReFrame/4.3.3/bin/reframe", line 19, in <module>
    import reframe.frontend.cli as cli  # noqa: F401, F403
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/ReFrame/4.3.3/lib/python3.11/site-packages/reframe/frontend/cli.py"
, line 21, in <module>
    import reframe.frontend.argparse as argparse
  File "/cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/ReFrame/4.3.3/lib/python3.11/site-packages/reframe/frontend/argpars
e.py", line 6, in <module>
    import argcomplete
ModuleNotFoundError: No module named 'argcomplete'

This should have been installed (to the external dir) by the bootstrap script, but apparently that failed. Scrolling back through the log file, I found:

==> [+pygelf] python3 -m pip install --no-cache-dir -q -r /tmp/eb-myh20jsw/tmp.V9e2PYU6tR --target=external/ --upgrade
DEPRECATION: Loading egg at /cvmfs/riscv.eessi.io/versions/20240402/software/linux/riscv64/generic/software/ReFrame/4.3.3/lib/python3.11/site-packages/pip-21.3.1-py3.11.egg is deprecated. pi
p 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
  error: subprocess-exited-with-error

   Getting requirements to build wheel did not run successfully.
   exit code: 1
  > [54 lines of output]
      running egg_info
      writing lib/PyYAML.egg-info/PKG-INFO
      writing dependency_links to lib/PyYAML.egg-info/dependency_links.txt
      writing top-level names to lib/PyYAML.egg-info/top_level.txt
      Traceback (most recent call last):
        File "/tmp/eb/easybuild/build/ReFrame/4.3.3/system-system/reframe/reframe-4.3.3/external/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/tmp/eb/easybuild/build/ReFrame/4.3.3/system-system/reframe/reframe-4.3.3/external/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/eb/easybuild/build/ReFrame/4.3.3/system-system/reframe/reframe-4.3.3/external/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 327, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=[])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 297, in _get_build_requires
          self.run_setup()
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 313, in run_setup
          exec(code, locals())
        File "<string>", line 288, in <module>
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/__init__.py", line 103, in setup
          return distutils.core.setup(**attrs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 184, in setup
          return run_commands(dist)
                 ^^^^^^^^^^^^^^^^^^
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 200, in run_commands
          dist.run_commands()
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
          self.run_command(cmd)
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/dist.py", line 976, in run_command
          super().run_command(command)
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/command/egg_info.py", line 321, in run
          self.find_sources()
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/command/egg_info.py", line 329, in find_sources
          mm.run()
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/command/egg_info.py", line 550, in run
          self.add_defaults()
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/command/egg_info.py", line 588, in add_defaults
          sdist.add_defaults(self)
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/command/sdist.py", line 102, in add_defaults
          super().add_defaults()
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/_distutils/command/sdist.py", line 250, in add_defaults
          self._add_defaults_ext()
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/_distutils/command/sdist.py", line 335, in _add_defaults_ext
          self.filelist.extend(build_ext.get_source_files())
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "<string>", line 204, in get_source_files
        File "/tmp/eb-myh20jsw/pip-build-env-afkyp1z7/overlay/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 107, in __getattr__
          raise AttributeError(attr)
      AttributeError: cython_sources
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

 Getting requirements to build wheel did not run successfully.
 exit code: 1

Based on a quick search, I suspect it's related to having Cython 3 in the compat layers, while we have 0.29 in the compat layers of the other architectures.

edit: Apparently PyYAML is causing the issue, all their versions before 6.0.1 don't work with Cython 3. They "fixed" that in 6.0.1 by pinning the Cython requirement to <3: https://github.com/yaml/pyyaml/issues/736 and https://github.com/yaml/pyyaml/pull/702

Solved in https://github.com/easybuilders/easybuild-easyconfigs/pull/20879

bedroge commented 20 hours ago

@julianmorillo was running into the issue described/discussed at https://gitlab.com/eessi/support/-/issues/32. The fix was merged shortly after we had built the RISC-V compat layer. I've fixed it manually by taking a slightly modified version of https://github.com/EESSI/compatibility-layer/blob/main/scripts/install-sssd-and-nss-pam-ldapd-EESSI.IO-2023.06_2024-04-15.sh (only some small version changes for some packages) and running it in our build container. The tarball with the new compat layer has been ingested. The diff in terms of packages:

diff in installed packages:
--- /tmp/tmp.5VJG9y6yvJ/installed-pkgs-pre-update.txt   2024-07-02 12:21:52.964193080 +0000
+++ /tmp/tmp.5VJG9y6yvJ/installed-pkgs-post-update.txt  2024-07-02 14:19:14.183405792 +0000
@@ -244,6 +244,7 @@
 net-misc/curl-8.7.1-r1::gentoo
 net-misc/rsync-3.2.7-r4::gentoo
 net-misc/wget-1.21.4::gentoo
+net-nds/openldap-2.6.6-r2::gentoo
 perl-core/File-Temp-0.231.100::gentoo
 perl-core/Math-BigInt-1.999.842::gentoo
 sec-keys/openpgp-keys-gentoo-release-20230329::gentoo
@@ -275,8 +276,10 @@
 sys-apps/texinfo-7.1-r1::gentoo
 sys-apps/util-linux-2.39.3-r6::gentoo
 sys-apps/which-2.21::gentoo
+sys-auth/nss-pam-ldapd-0.9.12-r2::eessi
 sys-auth/pambase-20240128::gentoo
 sys-auth/passwdqc-2.0.3-r1::gentoo
+sys-auth/sssd-2.8.2::eessi
 sys-cluster/lmod-8.7.23::gentoo
 sys-cluster/rdma-core-50.0::gentoo
 sys-devel/bc-1.07.1-r6::gentoo