FFTW / fftw3

DO NOT CHECK OUT THESE FILES FROM GITHUB UNLESS YOU KNOW WHAT YOU ARE DOING. (See below.)
GNU General Public License v2.0
2.66k stars 651 forks source link

Occasional failures in MPI part of the unit tests on ARM neoverse_v1 #334

Closed casparvl closed 6 days ago

casparvl commented 10 months ago

I've build FFTW on an ARM neoverse_v1 architecture. However, when running the test suite (make check) I get occasional failures. The strange thing is: they don't happen consistently, and they don't always happen in the same test. Two example (partial) outputs I've had:

 perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10  --mpi "mpirun -np 3 `pwd`/mpi-bench"
Executing "mpirun -np 3 /tmp/bot/easybuild/build/FFTWMPI/3.3.10/gompi-2022a/fftw-3.3.10/mpi/mpi-bench --verbose=1   --verify 'ofc]12x13' --verify 'ifc]12x13' --verify 'okd]30o11x12o10' --verify 'ikd]3
0o11x12o10' --verify 'obr9x12' --verify 'ibr9x12' --verify 'ofr9x12' --verify 'ifr9x12' --verify 'obc9x12' --verify 'ibc9x12' --verify 'ofc9x12' --verify 'ifc9x12' --verify 'ok9bx9o11v15' --verify 'ik
9bx9o11v15' --verify 'obr[12x8x8x4v1' --verify 'ibr[12x8x8x4v1' --verify 'obc[12x8x8x4v1' --verify 'ibc[12x8x8x4v1' --verify 'ofc[12x8x8x4v1' --verify 'ifc[12x8x8x4v1' --verify 'ok[36e01x13e10' --veri
fy 'ik[36e01x13e10' --verify 'ofrd]4x12x11' --verify 'ifrd]4x12x11' --verify 'obcd]4x12x11' --verify 'ibcd]4x12x11' --verify 'ofcd]4x12x11' --verify 'ifcd]4x12x11' --verify 'okd]5e10x7hx8hx9o10' --ver
ify 'ikd]5e10x7hx8hx9o10' --verify 'obr8x12x7v8' --verify 'ibr8x12x7v8' --verify 'ofr8x12x7v8' --verify 'ifr8x12x7v8' --verify 'obc8x12x7v8' --verify 'ibc8x12x7v8' --verify 'ofc8x12x7v8' --verify 'ifc
8x12x7v8' --verify 'obr[8x9x12' --verify 'ibr[8x9x12' --verify 'obc[8x9x12' --verify 'ibc[8x9x12' --verify 'ofc[8x9x12' --verify 'ifc[8x9x12'"
ofc]12x13 1.80122e-07 3.0542e-07 2.03125e-07
ifc]12x13 1.89156e-07 3.0542e-07 1.90368e-07
okd]30o11x12o10 2.4877e-07 1.78875e-06 2.25913e-07
ikd]30o11x12o10 2.27187e-07 1.55025e-06 2.10374e-07
obr9x12 1.6027e-07 2.75302e-07 1.52021e-07
ibr9x12 1.8753e-07 2.75302e-07 1.70429e-07
ofr9x12 1.47446e-07 1.83535e-07 1.44304e-07
ifr9x12 1.82756e-07 1.83535e-07 2.08951e-07
obc9x12 1.50418e-07 2.75302e-07 1.49625e-07
ibc9x12 1.97103e-07 2.75302e-07 1.70033e-07
ofc9x12 1.84721e-07 3.6707e-07 1.89722e-07
ifc9x12 1.99706e-07 2.75302e-07 1.81583e-07
ok9bx9o11v15 2.2058e-07 9.25407e-07 1.84591e-07
ik9bx9o11v15 1.92094e-07 8.1372e-07 1.5817e-07
obr[12x8x8x4v1 2.12069e-07 6.1943e-07 2.02999e-07
ibr[12x8x8x4v1 2.55186e-07 4.81779e-07 1.97777e-07
obc[12x8x8x4v1 2.02179e-07 6.1943e-07 1.96302e-07
ibc[12x8x8x4v1 1.86982e-07 4.81779e-07 2.00481e-07
ofc[12x8x8x4v1 2.03993e-07 5.50604e-07 1.95287e-07
ifc[12x8x8x4v1 2.00073e-07 4.81779e-07 1.75354e-07
ok[36e01x13e10 2.15839e-07 3.39746e-06 2.28886e-07
ik[36e01x13e10 2.27399e-07 3.39746e-06 2.38741e-07
ofrd]4x12x11 1.9559e-07 3.32027e-07 1.8842e-07
ifrd]4x12x11 1.82564e-07 4.15033e-07 1.70037e-07
obcd]4x12x11 1.82565e-07 4.15033e-07 1.87027e-07
ibcd]4x12x11 1.83111e-07 4.15033e-07 1.77082e-07
ofcd]4x12x11 1.67691e-07 4.15033e-07 1.91901e-07
ifcd]4x12x11 1.82908e-07 4.15033e-07 1.86803e-07
okd]5e10x7hx8hx9o10 2.03314e-07 6.73532e-06 2.24877e-07
ikd]5e10x7hx8hx9o10 1.91962e-07 5.63264e-06 3.09357e-07
obr8x12x7v8 2.00048e-07 3.31099e-07 1.55861e-07
ibr8x12x7v8 2.16178e-07 3.31099e-07 1.60998e-07
ofr8x12x7v8 2.21604e-07 2.48324e-07 2.02046e-07
ifr8x12x7v8 1.70646e-07 2.48324e-07 2.01528e-07
obc8x12x7v8 1.64523e-07 3.31099e-07 1.99198e-07
ibc8x12x7v8 1.70938e-07 3.31099e-07 1.82496e-07
ofc8x12x7v8 1.7008e-07 3.31099e-07 2.02178e-07
Found relative error 5.726162e-01 (linear)
       0  -2.821195125580  -4.044055461884    -2.821195363998  -4.044055461884
       1  -0.177852988243   1.493856668472    -0.177853316069   1.493856906891
       2  -2.473095178604   6.017243385315    -2.473095178604   6.017243385315
       3   2.993138074875  -3.688166141510     2.993136882782  -3.688166141510
...

i.e. an error in the 3-CPU tests. And:

perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10  --mpi "mpirun -np 4 `pwd`/mpi-bench"
Executing "mpirun -np 4 /tmp/bot/easybuild/build/FFTWMPI/3.3.10/gompi-2022a/fftw-3.3.10/mpi/mpi-bench --verbose=1   --verify 'okd]9e11x10e01x3o10x8e01' --verify 'ikd]9e11x10e01x3o10x8e01' --verify 'of
r]3x14x13x13' --verify 'ifr]3x14x13x13' --verify 'obc]3x14x13x13' --verify 'ibc]3x14x13x13' --verify 'ofc]3x14x13x13' --verify 'ifc]3x14x13x13' --verify 'ok[5o01x8o11x6o10' --verify 'ik[5o01x8o11x6o10
' --verify 'ofrd]6x8x4x3' --verify 'ifrd]6x8x4x3' --verify 'obcd]6x8x4x3' --verify 'ibcd]6x8x4x3' --verify 'ofcd]6x8x4x3' --verify 'ifcd]6x8x4x3' --verify 'obr[7x3x5x13v2' --verify 'ibr[7x3x5x13v2' --
verify 'obc[7x3x5x13v2' --verify 'ibc[7x3x5x13v2' --verify 'ofc[7x3x5x13v2' --verify 'ifc[7x3x5x13v2' --verify 'okd[15bx3hv4' --verify 'ikd[15bx3hv4' --verify 'ofrd]8x10x8v2' --verify 'ifrd]8x10x8v2'
--verify 'obcd]8x10x8v2' --verify 'ibcd]8x10x8v2' --verify 'ofcd]8x10x8v2' --verify 'ifcd]8x10x8v2' --verify 'ofr]7x9x3x7' --verify 'ifr]7x9x3x7' --verify 'obc]7x9x3x7' --verify 'ibc]7x9x3x7' --verify
 'ofc]7x9x3x7' --verify 'ifc]7x9x3x7' --verify 'ok[9e10x4o10x9o01x10o11' --verify 'ik[9e10x4o10x9o01x10o11' --verify 'ofr]9x11v16' --verify 'ifr]9x11v16' --verify 'obc]9x11v16' --verify 'ibc]9x11v16'
--verify 'ofc]9x11v16' --verify 'ifc]9x11v16'"
okd]9e11x10e01x3o10x8e01 4.19727e-16 3.21244e-14 4.61644e-16
ikd]9e11x10e01x3o10x8e01 3.8748e-16 3.47713e-14 4.81475e-16
ofr]3x14x13x13 4.63166e-16 1.18073e-15 8.89357e-16
Found relative error 2.591853e-01 (linear)
       0  -6.279300675262   0.000000000000   -10.738055078327   0.000000000000
       1  12.264581889801  10.133570174384    13.770492030446   4.771201691478
       2  13.995197474331   6.621093673792    18.700847526776   7.496929956226
       3   3.282732421788  -9.655418395947     1.816720801117  -5.446778948301
       4   0.493429414824   8.545142969216    -5.379298513227  10.196493253825
       5   7.279140203767   3.077925618575     0.816774589334   1.903078880407
...

i.e .a failure in the 4-CPU part of the tests.

When run interactively, I seem to get these failures about 1 out of 10 times. I also experience the occasional hang (looks like a deadlock, but I'm not sure).

We also do this build in an automated (confinuous deployment) environment, where it is build within a SLURM job. For some reason, there, it always seems to fail (or at least the fail rate is high enough that 5 attempts haven't led to a successful run).

My questions here:

lrbison commented 8 months ago

@casparvl @boegel

I followed this issue here from the EESSI repo. I'm trying to reproduce, but I haven't been able to do so . I've tried gcc 13.2.0, with Open MPI 4.1.6 and Open MPI 4.1.5. I'm running on an AWS hpc7g instance (ubuntu 2204). After being unable to reproduce directly from fftw source, I tried the following easybuild:

eb -dfr --from-pr 18884 --prefix=/fsx/eb --disable-cleanup-builddir

which is based on trying to reproduce https://gist.github.com/boegel/d97b974b8780c93753f1bf7462367082.

After the build, I can run make check in the builddir, but none of them reproduce the crash. Do you have any other suggestions on how to reproduce?

lrbison commented 8 months ago

One observation I have is that all the failures I've seen reported are from mpi-bench. It is true that mpirun may do slightly different things when it detects that it is running as part of a Slurm job. Can you provide any detail about how the slurm job is allocated or launched?

casparvl commented 8 months ago

I'm not sure of the exact job characteristics for the test build reported in https://gist.github.com/boegel/d97b974b8780c93753f1bf7462367082

For the builds done in EESSI I also couldn't tell you exactly what resources were requested in the job. But: this is run in a container, and then in a shell in which the only SLURM related job variable that is set is the SLURM_JOB_ID. So, I'm not sure if there is much for mpirun to pick up on here to figure out it actually is in a SLURM environment... Of course, SLURM can do things like set cgroups etc, which potentially affect how things run, but I couldn't tell you if that is done on this cluster. All node allocations here are exclusive, so I don't think a cgroup would do much anyway (as it would encompass the entire VM).

I did notice that I had fewer failures when I did the building interactively (though still in a job environment, it was an interactive SLURM job), as mentioned here. That seems to confirm that somehow environment has an affect, but... I couldn't really say what. This is a hard one :(

casparvl commented 8 months ago

Hm, I suddenly realize one difference between our bot building for EESSI, and your typical interactive environment: the bot not only builds in the container, it builds in a writeable overlay in the container. That tends to be a bit sluggish in terms of I/O. I'm wondering if that can somehow affect how these tests run. It's a bit far-fetched, and I wouldn't be able to explain the mechanism that makes it fail, but it would explain why my own interactive attempts showed a much higher success rate.

lrbison commented 8 months ago

Hm, in that container I wonder how many CPUs were allocated to it? I saw it was configured to allow oversubscription, I guess there is probably only 1 CPU core, which is different from my testing...

boegel commented 8 months ago

Our build nodes in AWS have 16 cores (*.4xlarge instances in AWS), using a single core would be way too slow.

Not sure what @casparvl used for testing interactively

lrbison commented 8 months ago

Is there a way for me to get access to that build container so I may try it myself?

casparvl commented 8 months ago

Yes, it's part of https://github.com/EESSI/software-layer . Your timing is pretty good, I very recently made a PR to our docs to explain how to use it to replicate build failures. PR isn't merged yet, but it's markdown, so you can simply view a rendered version in my feature branch. Links won't work in there, but I guess you can find your way around if need be - though I think this one markdown doc should cover it all.

casparvl commented 8 months ago

Btw, I've tried to reproduce it once again, since we now have a new build cluster (based on Magic Castle instead of Cluster in the Cloud). I've only tried interactively (basically following the docs I just shared), and I cannot for the life of me replicate our own issue. As mentioned in the original issue, interactively I had much higher success rates (9/10 times more or less), but I've ran

perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10  --mpi "mpirun -np 4 `pwd`/mpi-bench"

at least 20 times without failures now.

I'd love to see if the error still occurs when the bot builds it (as there it was consistently failing before), but my initial attempt failed for other reasons (basically, the bot cannot reinstall anything that already exists in the EESSI software stack - if you try, it'll fail on trying to change permissions on a read only file). I'll check with others if there is something I can do to work around this, so that I can actually trigger a rebuild with the bot.

lrbison commented 8 months ago

Yeah, I ran it over 200 times without failure on my cluster. Thank you for the pointers in that doc PR. I'll use that to try and trigger it again.

boegel commented 8 months ago

@casparvl Should I temporarily revive a node in our old CitC Slurm cluster, to check if the problem was somehow specific to that environment?

lrbison commented 7 months ago

@casparvl I haven't had the time to reproduce within a container. Are we still seeing the testing failures occur or is it not happening on the newer build cluster?

boegel commented 7 months ago

I am still seeing this problem on the our build cluster, when doing a test installation (in an interactive session) of FFTW.MPI/3.3.10-gompi-2023a for the new EESSI repository software.eessi.io.

A first attempt resulted in a segfault:

``` [aarch64-neoverse-v1-node2:2475846] Signal: Segmentation fault (11) [aarch64-neoverse-v1-node2:2475846] Signal code: (-6) [aarch64-neoverse-v1-node2:2475846] Failing at address: 0xea670025c746 [aarch64-neoverse-v1-node2:2475846] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x4000042507a0] [aarch64-neoverse-v1-node2:2475846] [ 1] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/libopen-pal.so.40(opal_convertor_generic_simple_position+0x10)[0x400004815b10] [aarch64-neoverse-v1-node2:2475846] [ 2] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/libopen-pal.so.40(opal_convertor_set_position_nocheck+0x120)[0x40000480dc60] [aarch64-neoverse-v1-node2:2475846] [ 3] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_request_progress_frag+0x360)[0x4000067f3be0] [aarch64-neoverse-v1-node2:2475846] [ 4] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/openmpi/mca_btl_smcuda.so(mca_btl_smcuda_component_progress+0x424)[0x4000060b6ec4] [aarch64-neoverse-v1-node2:2475846] [ 5] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/libopen-pal.so.40(opal_progress+0x3c)[0x4000047fc99c] [aarch64-neoverse-v1-node2:2475846] [ 6] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x2e4)[0x4000067ec3a4] [aarch64-neoverse-v1-node2:2475846] [ 7] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/libmpi.so.40(MPI_Sendrecv+0x188)[0x4000044a1228] [aarch64-neoverse-v1-node2:2475846] [ 8] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/libfftw3f_mpi.so.3(+0xa424)[0x40000426a424] [aarch64-neoverse-v1-node2:2475846] [ 9] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/libfftw3f_mpi.so.3(+0xa480)[0x40000426a480] [aarch64-neoverse-v1-node2:2475846] [10] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/libfftw3f_mpi.so.3(+0xc2e8)[0x40000426c2e8] [aarch64-neoverse-v1-node2:2475846] [11] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x40586c] [aarch64-neoverse-v1-node2:2475846] [12] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x409144] [aarch64-neoverse-v1-node2:2475846] [13] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x40aed4] [aarch64-neoverse-v1-node2:2475846] [14] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x409434] [aarch64-neoverse-v1-node2:2475846] [15] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x4083c0] [aarch64-neoverse-v1-node2:2475846] [16] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x408430] [aarch64-neoverse-v1-node2:2475846] [17] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x40619c] [aarch64-neoverse-v1-node2:2475846] [18] /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/lib/../lib64/libc.so.6(+0x26a7c)[0x400004586a7c] [aarch64-neoverse-v1-node2:2475846] [19] /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/lib/../lib64/libc.so.6(__libc_start_main+0x98)[0x400004586b4c] [aarch64-neoverse-v1-node2:2475846] [20] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x402ef0] [aarch64-neoverse-v1-node2:2475846] *** End of error message *** -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 0 on node aarch64-neoverse-v1-node2 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- FAILED mpirun -np 2 /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/mpi-bench: --verify 'ofc]12x11x4x7' --verify 'ifc]12x11x4x7' --verify 'ok7hx3o01x13e11' --verify 'ik7hx3o01x13e11' --verify 'obr6x2x8' --verify 'ibr6x2x8' --verify 'ofr6x2x8' --verify 'ifr6x2x8' --verify 'obc6x2x8' --verify 'ibc6x2x8' --verify 'ofc6x2x8' --verify 'ifc6x2x8' --verify 'ok]7o00x3o10' --verify 'ik]7o00x3o10' --verify 'ofr]5x6x12x10v1' --verify 'ifr]5x6x12x10v1' --verify 'obc]5x6x12x10v1' --verify 'ibc]5x6x12x10v1' --verify 'ofc]5x6x12x10v1' --verify 'ifc]5x6x12x10v1' --verify 'ok[3e11x13e11x9e10x9e00' --verify 'ik[3e11x13e11x9e10x9e00' --verify 'obr9x9' --verify 'ibr9x9' --verify 'ofr9x9' --verify 'ifr9x9' --verify 'obc9x9' --verify 'ibc9x9' --verify 'ofc9x9' --verify 'ifc9x9' --verify 'obrd11x24' --verify 'ibrd11x24' --verify 'ofrd11x24' --verify 'ifrd11x24' --verify 'obcd11x24' --verify 'ibcd11x24' --verify 'ofcd11x24' --verify 'ifcd11x24' --verify 'ok]8bx5o00x7o00x9e00' --verify 'ik]8bx5o00x7o00x9e00' --verify 'obc936' --verify 'ibc936' --verify 'ofc936' --verify 'ifc936' make[3]: *** [Makefile:997: check-local] Error 1 make[3]: Leaving directory '/tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi' ```

A 2nd attempt showed relative error again:

``` -------------------------------------------------------------- MPI FFTW transforms passed 10 tests, 1 CPU -------------------------------------------------------------- perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10 --mpi "mpirun -np 2 `pwd`/mpi-bench" obr[18x8x20 3.69362e-34 6.89042e-34 5.06167e-34 ibr[18x8x20 2.97137e-34 9.18722e-34 4.77233e-34 obc[18x8x20 3.25632e-34 5.74201e-34 7.15416e-34 ibc[18x8x20 3.65554e-34 6.89042e-34 6.48955e-34 ofc[18x8x20 3.19732e-34 5.74201e-34 6.1247e-34 ifc[18x8x20 3.37891e-34 5.74201e-34 8.06618e-34 ofr]3x13x9v6 2.8078e-34 3.83781e-34 8.48571e-34 ifr]3x13x9v6 2.98203e-34 3.83781e-34 7.858e-34 obc]3x13x9v6 3.76241e-34 5.75672e-34 7.44837e-34 Found relative error 5.965609e-01 (time shift) 0 4.921836295807 -1.026863218297 4.921836295807 -1.026863218297 1 -2.116798599036 2.941852671019 -2.116798599036 2.941852671019 2 1.771648568109 -0.438286594686 1.771648568109 -0.438286594686 3 -4.602865281776 5.484918179038 -4.602865281776 5.484918179038 4 2.819387086928 0.816109936207 2.819387086928 0.816109936207 5 -6.414801972466 5.098682093116 -6.414801972466 5.098682093116 6 -9.366154153178 -8.548834590260 -9.366154153178 -8.548834590260 7 -5.574288760734 4.865193642565 -5.574288760734 4.865193642565 8 -14.824759940401 5.035876685740 -14.824759940401 5.035876685740 9 -14.634486341483 1.743297726353 -14.634486341483 1.743297726353 10 -0.602576641742 -1.289954675842 -0.602576641742 -1.289954675842 11 3.105024350062 4.204442622313 3.105024350062 4.204442622313 12 9.179168922129 2.120173302885 9.179168922129 2.120173302885 13 -6.635818177405 -0.070873458827 -6.635818177405 -0.070873458827 14 -7.664759877207 -1.432949782471 -7.664759877207 -1.432949782471 15 2.136338652683 -3.130528874653 2.136338652683 -3.130528874653 16 -8.538098299824 5.591890241715 -8.538098299824 5.591890241715 17 -0.991271869558 -3.379819003153 -0.991271869558 -3.379819003153 18 -5.661447610867 7.529683912859 -5.661447610867 7.529683912859 19 -6.086167949355 -1.670238032124 -6.086167949355 -1.670238032124 20 -6.469068064222 7.690184841734 -6.469068064222 7.690184841734 21 1.465582751707 7.789424354982 1.465582751707 7.789424354982 22 -4.932751249830 -0.964580902292 -4.932751249830 -0.964580902292 23 -4.495483109168 3.138270992002 -4.495483109168 3.138270992002 24 -4.298238335069 -2.009670396150 -4.298238335069 -2.009670396150 25 -5.616046746225 -1.630859337171 -5.616046746225 -1.630859337171 26 -0.721988139199 -3.380289724460 -0.721988139199 -3.380289724460 27 6.817499183174 3.754929401943 6.817499183174 3.754929401943 28 -0.030920191076 4.600357644276 -0.030920191076 4.600357644276 29 0.839410098370 -1.908666344239 0.839410098370 -1.908666344239 30 3.385789170379 4.595090032781 3.385789170379 4.595090032781 31 4.379259724979 0.784057635193 4.379259724979 0.784057635193 32 11.841737046195 -6.148986050574 11.841737046195 -6.148986050574 33 -4.188145406309 4.506890617698 -4.188145406309 4.506890617698 34 2.987555638465 9.441583497205 2.987555638465 9.441583497205 35 -8.098881460448 -0.524743787520 -8.098881460448 -0.524743787520 36 -1.600749567878 -6.044191420031 -1.600749567878 -6.044191420031 37 0.163953738123 2.046467146682 0.163953738123 2.046467146682 38 -5.113238538613 -3.363399510184 -5.113238538613 -3.363399510184 39 -2.872798422536 -8.040245973957 -2.872798422536 -8.040245973957 40 -10.392736117901 -2.761172631260 -10.392736117901 -2.761172631260 41 4.039519771041 8.003816207053 4.039519771041 8.003816207053 42 1.790990423870 -8.383785422669 1.790990423870 -8.383785422669 43 -9.165783259172 -2.186625587455 -9.165783259172 -2.186625587455 44 5.007541953864 5.543722867012 5.007541953864 5.543722867012 45 2.365419650732 2.977801310135 2.365419650732 2.977801310135 46 -3.377120254702 4.906540019430 -3.377120254702 4.906540019430 47 -0.010783860068 4.273408211548 -0.010783860068 4.273408211548 48 -6.894392286266 6.830078049229 -6.894392286266 6.830078049229 49 -3.254264449347 6.744977714739 -3.254264449347 6.744977714739 50 -8.471641489793 5.603488318600 -8.471641489793 5.603488318600 51 0.084130029380 1.367262769771 0.084130029380 1.367262769771 52 1.482642504437 -4.602524328752 1.482642504437 -4.602524328752 53 0.788628072835 -9.891756192852 0.788628072835 -9.891756192852 54 -2.633046303010 11.214109607678 -2.633046303010 11.214109607678 55 -3.192499246401 3.363355364265 -3.192499246401 3.363355364265 56 -1.598444209258 3.573016880938 -1.598444209258 3.573016880938 57 5.522584641083 0.912730173997 5.522584641083 0.912730173997 58 -2.850571159892 -3.538531368267 -2.850571159892 -3.538531368267 59 0.289119554985 1.226480324376 0.289119554985 1.226480324376 60 -1.310174923968 -3.091891051678 -1.310174923968 -3.091891051678 61 -2.749495846212 -9.372017422996 -2.749495846212 -9.372017422996 62 3.279899011670 4.859168417630 3.279899011670 4.859168417630 63 2.379547285718 1.774931614389 2.379547285718 1.774931614389 64 4.662292029542 2.025644366541 4.662292029542 2.025644366541 65 -6.175223059442 -1.891888996868 -6.175223059442 -1.891888996868 66 1.731642745422 14.247081701735 1.731642745422 14.247081701735 67 -10.929576224104 -8.727780396180 -10.929576224104 -8.727780396180 68 5.844513943309 -1.235652769240 5.844513943309 -1.235652769240 69 4.853189951788 0.397500732336 4.853189951788 0.397500732336 70 1.645686104377 1.838816934461 1.645686104377 1.838816934461 71 -1.387808178933 -6.069222393915 -1.387808178933 -6.069222393915 72 -8.640352779734 7.623552803539 -8.640352779734 7.623552803539 73 -2.621092502218 6.557474990141 -2.621092502218 6.557474990141 74 -2.460425638794 0.126130793461 -2.460425638794 0.126130793461 75 -3.642105748754 -3.042790015208 -3.642105748754 -3.042790015208 76 0.903895069572 5.573680347688 0.903895069572 5.573680347688 77 -3.850746636008 -0.664540783961 -3.850746636008 -0.664540783961 78 2.670783169330 1.168453854800 2.670783169330 1.168453854800 79 0.863490161325 2.800910717379 0.863490161325 2.800910717379 80 -10.408734415051 -0.623237951468 -10.408734415051 -0.623237951468 81 -6.746215176255 -10.162136743830 -6.746215176255 -10.162136743830 82 6.010383700192 2.700168967362 6.010383700192 2.700168967362 83 7.250381313471 2.507195619411 7.250381313471 2.507195619411 84 5.728973913944 -2.066599007246 5.728973913944 -2.066599007246 85 -10.049824910825 5.688927229637 -10.049824910825 5.688927229637 86 2.592017899133 -1.850191728792 2.592017899133 -1.850191728792 87 10.779025866591 -1.076683736319 10.779025866591 -1.076683736319 88 -4.383388756630 1.650480826796 -4.383388756630 1.650480826796 89 -0.055685598972 -3.774783473873 -0.055685598972 -3.774783473873 90 6.628995072655 1.367150047102 6.628995072655 1.367150047102 91 -0.810232261568 -2.976939725877 -0.810232261568 -2.976939725877 92 -0.207344369538 -4.505328272435 -0.207344369538 -4.505328272435 93 5.262364487884 5.245089127649 5.262364487884 5.245089127649 94 -0.879545465455 -6.694733840184 -0.879545465455 -6.694733840184 95 -0.807449055017 -5.586509120899 -0.807449055017 -5.586509120899 96 4.706214482159 1.081938739490 4.706214482159 1.081938739490 97 -1.981403259786 7.529674456958 -1.981403259786 7.529674456958 98 2.203956996302 -4.983523613820 2.203956996302 -4.983523613820 99 -2.296628834421 2.179234813172 -2.296628834421 2.179234813172 100 9.173485452525 3.228133868069 9.173485452525 3.228133868069 101 -6.386943659435 6.926987789753 -6.386943659435 6.926987789753 102 3.076153055928 1.493617153748 3.076153055928 1.493617153748 103 10.054141435677 13.326661925432 10.054141435677 13.326661925432 104 8.463391787584 -5.877325613584 8.463391787584 -5.877325613584 105 -0.696625001947 -3.802301741098 -0.696625001947 -3.802301741098 106 -8.196977873692 -2.069536940407 -8.196977873692 -2.069536940407 107 2.948666032147 2.516823938344 2.948666032147 2.516823938344 108 -7.976790406507 -8.442930303150 -7.976790406507 -8.442930303150 109 -2.921418292350 0.328394194535 -2.921418292350 0.328394194535 110 2.105361692243 -1.048071016627 2.105361692243 -1.048071016627 111 -0.122956865261 -3.178104995804 -0.122956865261 -3.178104995804 112 1.377690789409 -1.577444205340 1.377690789409 -1.577444205340 113 -4.004584148861 -4.382890836537 -4.004584148861 -4.382890836537 114 0.011427712451 6.324099444670 0.011427712451 6.324099444670 115 5.826045729088 -14.340030576439 5.826045729088 -14.340030576439 116 7.577427495586 2.873642967239 7.577427495586 2.873642967239 117 -1.210172393913 3.087617904153 -1.210172393913 3.087617904153 118 4.129688769436 -0.269191081687 4.129688769436 -0.269191081687 119 2.498805623692 8.629698093887 2.498805623692 8.629698093887 120 0.180001563022 1.905778234978 0.180001563022 1.905778234978 121 7.007577095520 -8.896514053123 7.007577095520 -8.896514053123 122 6.566401034660 -3.159194023820 6.566401034660 -3.159194023820 123 -7.616361041524 -6.592271202720 -7.616361041524 -6.592271202720 124 -7.030945328309 -2.404710690963 -7.030945328309 -2.404710690963 125 -4.795666771461 7.565990037469 -4.795666771461 7.565990037469 126 -2.375104348185 1.918133142771 -2.375104348185 1.918133142771 127 4.793627396078 -11.569053139350 4.793627396078 -11.569053139350 128 0.825614651653 -5.877317639277 0.825614651653 -5.877317639277 129 6.404638041792 7.660923814373 6.404638041792 7.660923814373 130 -5.608845937279 6.189883435798 -5.608845937279 6.189883435798 131 -2.052132858903 -3.021527799608 -2.052132858903 -3.021527799608 132 -3.547036584342 -8.799408090505 -3.547036584342 -8.799408090505 133 -0.668169395838 0.242810562341 -0.668169395838 0.242810562341 134 6.968865621898 6.811579013049 6.968865621898 6.811579013049 135 -4.777970484256 4.042227001858 -4.777970484256 4.042227001858 136 -8.001526926080 -6.737608100204 -8.001526926080 -6.737608100204 137 -0.028276084813 2.602238255100 -0.028276084813 2.602238255100 138 -0.512308956568 -6.981404730691 -0.512308956568 -6.981404730691 139 9.387188053728 -13.669568093222 9.387188053728 -13.669568093222 140 4.861551650830 4.744755456001 4.861551650830 4.744755456001 141 -1.458785307411 4.663376600332 -1.458785307411 4.663376600332 142 -3.825099657621 -4.135819803719 -3.825099657621 -4.135819803719 143 7.337295848097 -5.254042209712 7.337295848097 -5.254042209712 144 -0.313260555864 6.771687218297 -0.313260555864 6.771687218297 145 -3.163795468737 8.593709314445 -3.163795468737 8.593709314445 146 -1.637608385520 -3.916686625097 -1.637608385520 -3.916686625097 147 2.893356680624 3.492613129989 2.893356680624 3.492613129989 148 -0.241462371122 7.603996304141 -0.241462371122 7.603996304141 149 4.792674968811 6.244544979428 4.792674968811 6.244544979428 150 -4.187404522818 3.480699468993 -4.187404522818 3.480699468993 151 2.275412058088 8.711606271295 2.275412058088 8.711606271295 152 9.309440618908 8.500678323888 9.309440618908 8.500678323888 153 -5.146801960557 0.480271780127 -5.146801960557 0.480271780127 154 -0.342934280885 4.006492082219 -0.342934280885 4.006492082219 155 -0.520225001067 -2.871435828872 -0.520225001067 -2.871435828872 156 -3.872971943304 1.447235114939 -3.872971943304 1.447235114939 157 -6.260170736857 -4.000013983045 -6.260170736857 -4.000013983045 158 -1.793247919295 4.904867267000 -1.793247919295 4.904867267000 159 -5.476491940734 2.221240632587 -5.476491940734 2.221240632587 160 -6.926551145538 4.990927999485 -6.926551145538 4.990927999485 161 8.742622092574 10.567091128674 8.742622092574 10.567091128674 162 1.485449402871 3.314914414563 1.485449402871 3.314914414563 163 -6.730468872131 -5.788026037934 -6.730468872131 -5.788026037934 164 -2.981885794878 -2.587215880092 -2.981885794878 -2.587215880092 165 1.447396351206 -12.814694105896 1.447396351206 -12.814694105896 166 -2.474143000457 -8.199676906604 -2.474143000457 -8.199676906604 167 -6.968727036826 6.621321359661 -6.968727036826 6.621321359661 168 -3.257964523801 0.484386452538 -3.257964523801 0.484386452538 169 2.319015390451 -1.703639037599 2.319015390451 -1.703639037599 170 -1.645353274574 11.946438535003 -1.645353274574 11.946438535003 171 -0.711343655735 -4.829312331723 -0.711343655735 -4.829312331723 172 -0.462013339680 3.395796127960 -0.462013339680 3.395796127960 173 -1.879403680530 -1.220043545876 -1.879403680530 -1.220043545876 174 1.907603137772 4.707561015705 1.907603137772 4.707561015705 175 4.690694650819 -3.134057632254 4.690694650819 -3.134057632254 176 -0.731397825734 10.216171123903 -0.731397825734 10.216171123903 177 1.727112370787 1.537556202680 1.727112370787 1.537556202680 178 9.804231130535 -3.050822838002 9.804231130535 -3.050822838002 179 -3.521642259704 8.644200067602 -3.521642259704 8.644200067602 180 1.847171292586 6.297594444781 1.847171292586 6.297594444781 181 -2.944580056826 -8.904668383923 -2.944580056826 -8.904668383923 182 -0.479878773202 9.252293550971 -0.479878773202 9.252293550971 183 8.105438096502 -0.100680885472 8.105438096502 -0.100680885472 184 -3.261705711112 5.625865138249 -3.261705711112 5.625865138249 185 -9.001340472449 3.481531232669 -9.001340472449 3.481531232669 186 -9.922858428321 8.928064172077 -9.922858428321 8.928064172077 187 -0.262071419781 -2.637186613527 -0.262071419781 -2.637186613527 188 13.976902439634 2.365843139075 13.976902439634 2.365843139075 189 -1.200796504034 -4.514856210494 -1.200796504034 -4.514856210494 190 10.243971312066 4.464830376249 10.243971312066 4.464830376249 191 -2.874371721067 -6.435933215796 -2.874371721067 -6.435933215796 192 1.013001932314 2.999699060836 1.013001932314 2.999699060836 193 -0.993840710862 -6.386582096375 -0.993840710862 -6.386582096375 194 3.561437964884 7.779957565555 3.561437964884 7.779957565555 195 9.312380923566 -6.079419786231 9.312380923566 -6.079419786231 196 0.417492073520 0.675369898888 0.417492073520 0.675369898888 197 -5.373267320387 -5.228378193047 -5.373267320387 -5.228378193047 198 2.811480320243 -1.530828750353 2.811480320243 -1.530828750353 199 -3.810636424898 -4.270965066499 -3.810636424898 -4.270965066499 200 -1.929116070223 -2.795097831046 -1.929116070223 -2.795097831046 201 -4.910461489544 3.949953732577 -4.910461489544 3.949953732577 202 -1.110838593410 0.859180227354 -1.110838593410 0.859180227354 203 -2.647010599309 10.090425689658 -2.647010599309 10.090425689658 204 0.055618930342 10.953225553089 0.055618930342 10.953225553089 205 7.677359001006 1.191345729669 7.677359001006 1.191345729669 206 -1.498323690654 0.356861042000 -1.498323690654 0.356861042000 207 2.097613279533 1.809878602708 2.097613279533 1.809878602708 208 -10.433389542881 -2.767883226818 -10.433389542881 -2.767883226818 209 4.485006007605 -2.861710075652 4.485006007605 -2.861710075652 210 -11.299061334429 -4.240819220427 -11.299061334429 -4.240819220427 211 0.889330359867 -6.122606788728 0.889330359867 -6.122606788728 212 1.644972522082 -4.130609805857 1.644972522082 -4.130609805857 213 3.119752911839 5.520336783880 3.119752911839 5.520336783880 214 6.451529263230 -6.991195712115 6.451529263230 -6.991195712115 215 1.950360060868 -5.530643072460 1.950360060868 -5.530643072460 216 6.040150340031 4.344206024582 6.040150340031 4.344206024582 217 1.752640417864 -13.456400163587 1.752640417864 -13.456400163587 218 13.891823564455 -4.615650120662 13.891823564455 -4.615650120662 219 3.353087607440 -5.568825085630 3.353087607440 -5.568825085630 220 -0.238755291286 -0.122203225111 -0.238755291286 -0.122203225111 221 -4.994487942039 -2.421765879674 -4.994487942039 -2.421765879674 222 -8.659719429484 2.921445084397 -8.659719429484 2.921445084397 223 1.322825765261 6.523414416538 1.322825765261 6.523414416538 224 0.383609830312 -11.798272908431 0.383609830312 -11.798272908431 225 -4.959900682847 -4.419719391506 -4.959900682847 -4.419719391506 226 -1.407603649500 -2.756941605224 -1.407603649500 -2.756941605224 227 -7.044264525785 0.083191244366 -7.044264525785 0.083191244366 228 5.027093393519 5.195264035163 5.027093393519 5.195264035163 229 1.563992212574 -0.701216248220 1.563992212574 -0.701216248220 230 0.306554234674 4.476987321667 0.306554234674 4.476987321667 231 1.226269348284 -2.296913229853 1.226269348284 -2.296913229853 232 -4.098996468141 -6.855165091528 -4.098996468141 -6.855165091528 233 -8.845292917687 -0.923422749681 -8.845292917687 -0.923422749681 234 -4.250287799692 3.557076157786 -4.250287799692 3.557076157786 235 0.469057774787 8.279657163755 0.469057774787 8.279657163755 236 4.340048272752 -0.232303117938 4.340048272752 -0.232303117938 237 1.752288340162 -4.554038855546 1.752288340162 -4.554038855546 238 -2.786461997863 0.349152549109 -2.786461997863 0.349152549109 239 -9.048613296502 4.902932369427 -9.048613296502 4.902932369427 240 1.292868067079 10.372646253328 1.292868067079 10.372646253328 251 6.531174970235 3.643867565208 6.531174970235 3.643867565208 252 -4.541992135905 0.814485798927 -4.541992135905 0.814485798927 253 10.547201095289 13.176243470534 10.547201095289 13.176243470534 254 1.680946638121 13.004362273317 1.680946638121 13.004362273317 255 7.244027605296 -1.038411963768 7.244027605296 -1.038411963768 262 -5.925648177938 -1.268203314222 -5.925648177938 -1.268203314222 267 1.353136474995 -2.122072383697 1.353136474995 -2.122072383697 268 3.486826915908 1.572000562698 3.486826915908 1.572000562698 269 -4.426443526401 2.623133044118 -4.426443526401 2.623133044118 270 8.632661103766 -11.643280600455 8.632661103766 -11.643280600455 271 3.232947482898 8.184951094877 3.232947482898 8.184951094877 272 -0.058647847283 -6.334711265114 -0.058647847283 -6.334711265114 273 -0.941586491285 8.349265532221 -0.941586491285 8.349265532221 274 -6.447305295794 -5.049955925120 -6.447305295794 -5.049955925120 275 -11.945598200550 -4.015966059585 -11.945598200550 -4.015966059585 276 0.362306726308 14.450774594960 0.362306726308 14.450774594960 277 -13.833931179943 -7.361791104432 -13.833931179943 -7.361791104432 278 0.768431424357 6.017412350709 0.768431424357 6.017412350709 279 0.030233743672 -2.307785598212 0.030233743672 -2.307785598212 280 10.014906601641 3.051751435802 10.014906601641 3.051751435802 281 5.112072879529 -4.132941863717 5.112072879529 -4.132941863717 282 -0.438617802708 10.276119662869 -0.438617802708 10.276119662869 283 -3.027137527217 -0.561076703303 -3.027137527217 -0.561076703303 284 3.926003026828 -4.086725429315 3.926003026828 -4.086725429315 285 0.786845785234 -1.530531963474 0.786845785234 -1.530531963474 286 -2.893235031611 -8.453773261229 -2.893235031611 -8.453773261229 287 11.596087554883 -4.013957133276 11.596087554883 -4.013957133276 288 -4.988489747276 11.688234628619 -4.988489747276 11.688234628619 289 -5.099846866775 3.149676053203 -5.099846866775 3.149676053203 290 3.993544832699 -2.176510514608 3.993544832699 -2.176510514608 291 1.791994775922 2.679198098395 1.791994775922 2.679198098395 292 6.229541027538 7.197596224506 6.229541027538 7.197596224506 293 -2.690450075242 6.678106532908 -2.690450075242 6.678106532908 294 7.028412388425 10.238169492735 7.028412388425 10.238169492735 295 -4.703505231104 -6.328634949054 -4.703505231104 -6.328634949054 296 -14.073077800312 -6.540533668748 -14.073077800312 -6.540533668748 297 -2.359761010290 4.669844938190 -2.359761010290 4.669844938190 298 -3.973951647153 -7.985259797914 -3.973951647153 -7.985259797914 299 4.741028202046 -0.901953828990 4.741028202046 -0.901953828990 FAILED mpirun -np 2 /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/mpi-bench: --verify 'obr[18x8x20' --verify 'ibr[18x8x20' --verify 'obc[18x8x20' --verify 'ibc[18x8x20' --verify 'ofc[18x8x20' --verify 'ifc[18x8x20' --verify 'ofr]3x13x9v6' --verify 'ifr]3x13x9v6' --verify 'obc]3x13x9v6' --verify 'ibc]3x13x9v6' --verify 'ofc]3x13x9v6' --verify 'ifc]3x13x9v6' --verify 'okd[10e00x7e01x4e00v11' --verify 'ikd[10e00x7e01x4e00v11' --verify 'obrd[10x11x3x10' --verify 'ibrd[10x11x3x10' --verify 'obcd[10x11x3x10' --verify 'ibcd[10x11x3x10' --verify 'ofcd[10x11x3x10' --verify 'ifcd[10x11x3x10' --verify 'okd]9o10x10e00x10e00x10b' --verify 'ikd]9o10x10e00x10e00x10b' --verify 'ofrd]3x12x6v6' --verify 'ifrd]3x12x6v6' --verify 'obcd]3x12x6v6' --verify 'ibcd]3x12x6v6' --verify 'ofcd]3x12x6v6' --verify 'ifcd]3x12x6v6' --verify 'okd6bx2e00v9' --verify 'ikd6bx2e00v9' --verify 'obr5x2x6v2' --verify 'ibr5x2x6v2' --verify 'ofr5x2x6v2' --verify 'ifr5x2x6v2' --verify 'obc5x2x6v2' --verify 'ibc5x2x6v2' --verify 'ofc5x2x6v2' --verify 'ifc5x2x6v2' --verify 'ofr]12x5x10v3' --verify 'ifr]12x5x10v3' --verify 'obc]12x5x10v3' --verify 'ibc]12x5x10v3' --verify 'ofc]12x5x10v3' --verify 'ifc]12x5x10v3' make[3]: *** [Makefile:997: check-local] Error 1 make[3]: Leaving directory '/tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi' ```
lrbison commented 7 months ago

I tried to replicate this over the weekend. @casparvl's documentation was extremely helpful, thank you! I tried to debug this PR: https://github.com/EESSI/software-layer/pull/374/files

git clone https://github.com/EESSI/software-layer.git
cd software-layer
git remote add https://github.com/casparvl/software-layer casparvl
git remote add casparvl https://github.com/casparvl/software-layer
git fetch casparvl
git checkout casparvl/fftw_test
./eessi_container.sh --access rw --save /fsx/essi-fftw1

And then within the easybuild container did this in a loop:

eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot

It ran 374 times over the weekend without failure on an hpc7g.16xlarge (64 cores).

@casparvl sounded like you suspected a writable overlay could cause more slugish I/O. I'm not familiar enough with eessi container, but I think with the access rw I have done that, correct?

Do either of you have other ideas for me to change? I suppose I can switch to a c7g.4xlarge....

lrbison commented 7 months ago

I was able to compile and successfully run on c7g.4xlarge as well, with no issues there either.

lrbison commented 6 months ago

@casparvl Do you have other ideas on how I can try to reproduce? I'm not sure if it matters, but my attempt was on an Ubuntu 2004 and the container was started using: ./eessi_container.sh --access rw --save /fsx/lrbison/essi-fftw1 where the mount was hosted from FSx for Lustre file system.

My repeated testing was repeated calls of eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot rather than repeatedly starting the container.

casparvl commented 6 months ago

Sorry for failing to come back to you on this. I'll try again myself as well. I just did one install, which indeed was succesfull. Second time, I ran into the same error as @boegel had the 2nd time around:

Error log:
``` MPI FFTW transforms passed 10 tests, 3 CPUs -------------------------------------------------------------- perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10 --mpi "mpirun -np 4 `pwd`/mpi-bench" Executing "mpirun -np 4 /tmp/eessi-debug.n0muoZ0cuh/easybuild/build/FFTWMPI/3.3.10/gompi-2022a/fftw-3.3.10/mpi/mpi-bench --verbose=1 --verify 'ofc10x10x3' --verify 'ifc10x10x3' --verify 'ok ]16bx11o11v6' --verify 'ik]16bx11o11v6' --verify 'ofr]12x13x8' --verify 'ifr]12x13x8' --verify 'obc]12x13x8' --verify 'ibc]12x13x8' --verify 'ofc]12x13x8' --verify 'ifc]12x13x8' --verify 'okd ]12o01x30e00' --verify 'ikd]12o01x30e00' --verify 'ofrd]3x6x3x4' --verify 'ifrd]3x6x3x4' --verify 'obcd]3x6x3x4' --verify 'ibcd]3x6x3x4' --verify 'ofcd]3x6x3x4' --verify 'ifcd]3x6x3x4' --veri fy 'okd[8o11x9e10x10o00x10e01' --verify 'ikd[8o11x9e10x10o00x10e01' --verify 'obrd12x12x5v2' --verify 'ibrd12x12x5v2' --verify 'ofrd12x12x5v2' --verify 'ifrd12x12x5v2' --verify 'obcd12x12x5v2' --verify 'ibcd12x12x5v2' --verify 'ofcd12x12x5v2' --verify 'ifcd12x12x5v2' --verify 'ok[13e11x52o00' --verify 'ik[13e11x52o00' --verify 'obrd[8x7v2' --verify 'ibrd[8x7v2' --verify 'obcd[8x7v2' --verify 'ibcd[8x7v2' --verify 'ofcd[8x7v2' --verify 'ifcd[8x7v2' --verify 'obr12x3x2x8' --verify 'ibr12x3x2x8' --verify 'ofr12x3x2x8' --verify 'ifr12x3x2x8' --verify 'obc12x3x2x8' --verify 'ibc12x3x2x8' --verify 'ofc12x3x2x8' --verify 'ifc12x3x2x8'" ofc10x10x3 1.95174e-07 3.30362e-07 1.86409e-07 ifc10x10x3 1.7346e-07 3.30362e-07 2.59827e-07 ok]16bx11o11v6 1.73834e-07 1.48147e-06 1.88905e-07 ik]16bx11o11v6 2.28489e-07 1.60348e-06 1.94972e-07 ofr]12x13x8 2.74646e-07 4.3193e-07 1.84938e-07 ifr]12x13x8 1.88937e-07 4.3193e-07 1.63803e-07 obc]12x13x8 2.10673e-07 4.3193e-07 2.28376e-07 ibc]12x13x8 1.97341e-07 4.3193e-07 2.27807e-07 ofc]12x13x8 2.19374e-07 5.39912e-07 2.17205e-07 ifc]12x13x8 2.08943e-07 4.3193e-07 2.19416e-07 okd]12o01x30e00 2.51417e-07 4.47886e-06 1.94862e-07 ikd]12o01x30e00 2.48254e-07 5.89166e-06 3.59064e-07 ofrd]3x6x3x4 1.82793e-07 2.59557e-07 1.48863e-07 ifrd]3x6x3x4 1.75387e-07 2.59557e-07 1.83453e-07 obcd]3x6x3x4 1.89722e-07 3.24447e-07 1.87965e-07 ibcd]3x6x3x4 1.94751e-07 3.24447e-07 1.69235e-07 ofcd]3x6x3x4 1.69961e-07 3.24447e-07 1.56861e-07 ifcd]3x6x3x4 1.82658e-07 3.24447e-07 1.69306e-07 Found relative error 2.900030e+35 (time shift) 0 -164.457138061523 -164.457199096680 1 -225.637115478516 -225.637100219727 2 -20.902750015259 -20.902732849121 3 172.414703369141 172.414733886719 4 -4.662590026855 -4.662593841553 5 7.010725498199 7.010738372803 6 -89.267349243164 -89.267364501953 7 326.806823730469 326.806823730469 8 -19.448410034180 -19.448524475098 9 69.001441955566 69.001434326172 10 -104.643005371094 -104.643020629883 11 -26.874126434326 -26.874076843262 12 -24.399785995483 -24.399751663208 13 -141.903198242188 -141.903198242188 14 -90.872367858887 -90.872360229492 15 -44.611225128174 -44.611217498779 16 41.871009826660 41.871009826660 17 176.062194824219 176.062194824219 18 -90.186141967773 -90.186141967773 19 -4.998665332794 -4.998687744141 ```

Running it a third time, it completed succesfully again.

The only thing you don't mention explicitely is if you also followed the steps of activating the prefix environment & EESSI pilot stack, as described on https://www.eessi.io/docs/adding_software/debugging_failed_builds/ , and if you sourced the configure_easybuild script. Did you do that?

If you didn't I guess that means you've built the full software stack from the ground up. If that's the case, and if that works, then I guess the conclusion is something is fishy with one of the FFTW.MPI dependencies we pick up from the EESSI pilot stack (and for which you would have done a fresh build). That's useful information, because it would show that the combination of using the dependencies from EESSI somehow trigger this issue. Also, it'd mean you could actually try those steps as well (i.e. start the prefix environment, start the EESSI pilot stack, source the configure_easybuild script), and see if you can replicate the issue that way. That would unambiguously prove that the issue is somewhere in the dependencies that we already have in the stack.

Just for reference, this is a snippet of my history from the point I start the container, to having run the eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot command once:

    1  EESSI_CVMFS_REPO=/cvmfs/pilot.eessi-hpc.org/
    2  EESSI_PILOT_VERSION=2023.06
    3  source ${EESSI_CVMFS_REPO}/versions/${EESSI_PILOT_VERSION}/init/bash
    4  export WORKDIR=$(mktemp --directory --tmpdir=/tmp  -t eessi-debug.XXXXXXXXXX)
    5  source configure_easybuild
    6  module load EasyBuild/4.8.1
    7  eb --show-config
    8  eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot

The result of eb --show-config is:

[EESSI pilot 2023.06] $ eb --show-config
#
# Current EasyBuild configuration
# (C: command line argument, D: default value, E: environment variable, F: configuration file)
#
buildpath            (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/build
containerpath        (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/containers
debug                (E) = True
experimental         (E) = True
filter-deps          (E) = Autoconf, Automake, Autotools, binutils, bzip2, DBus, flex, gettext, gperf, help2man, intltool, libreadline, libtool, Lua, M4, makeinfo, ncurses, util-linux, XZ, zlib, Yasm
filter-env-vars      (E) = LD_LIBRARY_PATH
hooks                (E) = /home/casparvl/debug_PR374/software-layer/eb_hooks.py
ignore-osdeps        (E) = True
installpath          (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_n1/testing
module-extensions    (E) = True
packagepath          (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/packages
prefix               (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild
read-only-installdir (E) = True
repositorypath       (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/ebfiles_repo
robot-paths          (D) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_n1/software/EasyBuild/4.8.1/easybuild/easyconfigs
rpath                (E) = True
sourcepath           (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/sources:
sysroot              (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64
trace                (E) = True
zip-logs             (E) = bzip2

Curious to hear if you ran using the EESSI pilot stack for dependencies. Maybe you can also share your eb --show-config output.

casparvl commented 6 months ago

I'm also still puzzled by the randomness of this issue. I'd love to better understand why the failrue of these tests are random. Is the input randomly generated? Is the algorithm simply non-deterministic (e.g. because of non-deterministic order in reduction operations or something of that nature)? I'd love to understand if that 'randomness' could somehow be affected by environment, as initially I seem to have seen many more failures in a job environment than interactively... But I'm not sure if any of you has such an intricate knowledge of what these particular tests do :)

lrbison commented 6 months ago

Yes, I'm afraid I can't speak for the fftw developers here, perhaps @matteo-frigo could help answer the question about what ../tests/check.pl is checking, and if the failures are catastrophic or simply small precision errors?

lrbison commented 6 months ago

@casparvl

My complete steps are here:

git clone https://github.com/EESSI/software-layer.git
cd software-layer
git remote add https://github.com/casparvl/software-layer casparvl
git remote add casparvl https://github.com/casparvl/software-layer
git fetch casparvl
git checkout casparvl/fftw_test
./eessi_container.sh --access rw --save /fsx/lrbison/essi-fftw1
Apptainer> echo ${EESSI_CVMFS_REPO}; echo ${EESSI_PILOT_VERSION}
/cvmfs/pilot.eessi-hpc.org
2023.06

export EESSI_OS_TYPE=linux  # We only support Linux for now
export EESSI_CPU_FAMILY=$(uname -m)
${EESSI_CVMFS_REPO}/versions/${EESSI_PILOT_VERSION}/compat/${EESSI_OS_TYPE}/${EESSI_CPU_FAMILY}/startprefix
#...(wait a bit)
export EESSI_CVMFS_REPO=/cvmfs/pilot.eessi-hpc.org
export EESSI_PILOT_VERSION=2023.06
source ${EESSI_CVMFS_REPO}/versions/${EESSI_PILOT_VERSION}/init/bash

export WORKDIR=/tmp/try1
source configure_easybuild
module load EasyBuild/4.8.1
eb --show-config

eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot

Sadly I didn't save my easybuild output, let me re-create again. I am curious, when you "retry" do you retry from eb --easystack... or do you retry from ./eessi_container.sh ...?

casparvl commented 6 months ago

Ok, so you also built on top of the dependencies that were already provided from the EESSI side. Then I really don't see any differences, other than (potentially) things in the environment... Strange!

I am curious, when you "retry" do you retry from eb --easystack... or do you retry from ./eessi_container.sh ...?

Like you, I retried from eb --easystack .... So, I get different results, even without restarting the container...

Also interesting, I've tried a 4th time. Now I get a hanging process. I.e. I see two lt-mpi-bench processes using ~100% CPU, and having done so for 66 minutes straight. They normally complete much faster. MPI deadlock...?

lrbison commented 6 months ago

I would love a backtrace of both of those processes!

casparvl commented 6 months ago

Great idea... but unfortunately my allocation ended 2 minutes after I noticed the hang :( I'm pretty sure I had process hangs before as well, when I ran into this issue originally. I'll try to run it a couple more times tonight, see if I can trigger it again and get a backtrace...

casparvl commented 6 months ago

Hm, while trying to reproduce my hang (which I didn't succeed in yet), I noticed something: the automatic initialization script from EESSI thinks this node is a neoverse_n1. I seem to remember some chatter about this architecture not being detected properly, but thought we fixed that - maybe not. Anyway, it will build against dependencies optimized on neoverse_n1. I'm pretty sure our build bot overrides this automatic CPU architecture detection, but maybe @boegel can confirm... It would at least point to one difference between what our bot does, and what I get interactively.

Anyway, for now, I'll override myself with export EESSI_SOFTWARE_SUBDIR_OVERRIDE=aarch64/neoverse_v1 before sourcing the init script. See where that takes me in terms of build failures, hangs, etc.

casparvl commented 6 months ago

Interesting, now that I correctly use the right dependencies (due to export EESSI_SOFTWARE_SUBDIR_OVERRIDE=aarch64/neoverse_v1), the failures are suddenly consistent, instead of occassional. Maybe you could give that a try as well: set it after running startprefix, but before sourcing the initialization script. Also, at this point, you may unset EESSI_SILENT. That will course the init script to print what architecture is selected (it should respect your override, but it's good to check).

I've run it about 10-15 times now. Each time, it fails with a numerical error like the one above. Now, finally, I've managed to reproduce the hanging 2 processes. Here's the backtrace:

(gdb) bt full
#0  0x000040002c61c604 in opal_timer_linux_get_cycles_sys_timer ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libopen-pal.so.40
No symbol table info available.
#1  0x000040002c5ccaec in opal_progress_events.isra ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libopen-pal.so.40
No symbol table info available.
#2  0x000040002c5ccc88 in opal_progress () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libopen-pal.so.40
No symbol table info available.
#3  0x000040002c22babc in ompi_request_default_wait () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#4  0x000040002c27e284 in ompi_coll_base_sendrecv_actual ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#5  0x000040002c27f40c in ompi_coll_base_allreduce_intra_recursivedoubling ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#6  0x000040002c27fad4 in ompi_coll_base_allreduce_intra_ring ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#7  0x000040002ea861cc in ompi_coll_tuned_allreduce_intra_dec_fixed ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/openmpi/mca_coll_tuned.so
No symbol table info available.
#8  0x000040002c23b4e8 in PMPI_Allreduce () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#9  0x000040002c0161d0 in fftwf_mpi_any_true ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#10 0x000040002c067648 in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#11 0x000040002c06781c in fftwf_mkplan_d ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#12 0x000040002c01ef0c in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#13 0x000040002c0670e8 in search0 () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#14 0x000040002c0673a4 in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#15 0x000040002c06781c in fftwf_mkplan_d ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#16 0x000040002c01e49c in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#17 0x000040002c0670e8 in search0 () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#18 0x000040002c0673a4 in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#19 0x000040002c0e83ac in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#20 0x000040002c0e85a0 in fftwf_mkapiplan ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#21 0x000040002c017aac in fftwf_mpi_plan_guru_r2r ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#22 0x000040002c017bcc in fftwf_mpi_plan_many_r2r ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#23 0x0000000000404928 in mkplan ()
No symbol table info available.
#24 0x0000000000405778 in setup ()
No symbol table info available.
#25 0x00000000004085e0 in verify ()
No symbol table info available.
#26 0x0000000000406498 in bench_main ()
No symbol table info available.
#27 0x000040002c346a7c in ?? () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64/lib/../lib64/libc.so.6
No symbol table info available.
#28 0x000040002c346b4c in __libc_start_main () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64/lib/../lib64/libc.so.6
No symbol table info available.
#29 0x0000000000402f30 in _start ()
No symbol table info available.
boegel commented 6 months ago

Hm, while trying to reproduce my hang (which I didn't succeed in yet), I noticed something: the automatic initialization script from EESSI thinks this node is a neoverse_n1. I seem to remember some chatter about this architecture not being detected properly, but thought we fixed that - maybe not. Anyway, it will build against dependencies optimized on neoverse_n1. I'm pretty sure our build bot overrides this automatic CPU architecture detection, but maybe @boegel can confirm... It would at least point to one difference between what our bot does, and what I get interactively.

Our bot indeed overrides the CPU auto-detection during building, because archspec is sometimes a bit too pedantic (see for example https://github.com/archspec/archspec-json/issues/38).

In software.eessi.io we've switched to our own pure bash archdetect mechanism, which is less pedantic, but that's not used during build either: the build bot just sets $EESSI_SOFTWARE_SUBDIR_OVERRIDE based on it's configuration.

lrbison commented 6 months ago

Seems like we (you) are making progress! I tried to add your override. Here is my eb config:

buildpath            (E) = /tmp/try1/easybuild/build
containerpath        (E) = /tmp/try1/easybuild/containers
debug                (E) = True
experimental         (E) = True
filter-deps          (E) = Autoconf, Automake, Autotools, binutils, bzip2, DBus, flex, gettext, gperf, help2man, intltool, libreadline, libtool, Lua, M4, makeinfo, ncurses, util-linux, XZ, zlib, Yasm
filter-env-vars      (E) = LD_LIBRARY_PATH
hooks                (E) = /tmp/software-layer/eb_hooks.py
ignore-osdeps        (E) = True
installpath          (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing
module-extensions    (E) = True
packagepath          (E) = /tmp/try1/easybuild/packages
prefix               (E) = /tmp/try1/easybuild
read-only-installdir (E) = True
repositorypath       (E) = /tmp/try1/easybuild/ebfiles_repo
robot-paths          (D) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/EasyBuild/4.8.1/easybuild/easyconfigs
rpath                (E) = True
sourcepath           (E) = /tmp/try1/easybuild/sources:
sysroot              (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64
trace                (E) = True
zip-logs             (E) = bzip2

But I still don't get failures during testing.

I do think allreduce has the potential to be non-deterministic, however I'm unsure if the ompi_coll_base_allreduce_intra_ring implementation is or isn't deterministic.

I wonder, is there a way for me to continually run the test without rebuilding each time?

casparvl commented 6 months ago

It is possible. What you could do is stop the EasyBuild installation after a certain point using the --stop argument. You can do that by editing the yaml file and make it look like this at the end:

  - FFTW.MPI-3.3.10-gompi-2022a.eb:
      options:
        rebuild: True
        stop: 'build'

This should stop it after the build step (and before the test step). Then, you'd want to run

eb FFTW.MPI-3.3.10-gompi-2022a.eb --dump-env-script

This will dump a script FFTW.MPI-3.3.10-gompi-2022a.env that you can source to get the same environment that EasyBuild has during the build. Then, check one of your prior builds (done before you added the 'stop' in the yaml file) to see what command was executed by EasyBuild as its test step and in which directory. The logs are pretty verbose, so it may be a bit of a puzzle to find, but at least it shows all that information.

Then, source that FFTW.MPI-3.3.10-gompi-2022a.env, and go to the directory in which EasyBuild normally runs its test step (or an equivalent dir: your tempdir might be different between your stopped build, and the prior build you inspected the logs for. So the prefix might look a little different) and run the command that EasyBuild also ran as 'test step'. That last command, you should be able to put in a loop.

casparvl commented 6 months ago

By the way, your installpath from the eb --show-config shows that you are indeed using the neoverse_v1 copy of the software stack (which should be the case since you use the override), so that's good.

I'm absolutely puzzled by why things are different for you than for us. Short from seeing if we could have you test things on our cluster, I don't know what else to try for you to reproduce the failure... :/ I that's something you would be up for, see if you can reach out to @boegel on the EESSI Slack in a DM (join here if you're not yet on that channel), he might be able to arrange it for you.

@boegel maybe you could also do the reverse: spin up a regular VM outside of our Magic Castle setup and see if you can reproduce the issue there? If not, it must be related to our cluster setup somehow...

Also a heads up: I'm going to be on quite a long leave, so won't be able to respond for the next month or so. Again, maybe @boegel can follow up if needed :)

lrbison commented 6 months ago

Thank you for the testing insight and the slack invite. Enjoy the break. I'll talk to @boegel on slack and see what he thinks is a reasonable next step.

boegel commented 5 months ago

@lrbison When would you like to follow up on this?

lrbison commented 5 months ago

I talked offline with Kenneth.

In the mean time, my pattern-matching neurons fired:

both https://github.com/FFTW/fftw3/issues/334#issuecomment-1820587375 and https://gitlab.com/eessi/support/-/issues/24#note_1734228961 have something in common:

Both are in mca_btl_smcuda_component_progress from the smcuda module, but I recall smcuda should really only be engaged when CUDA/ROCm/{accelerator} memory is used, otherwise we should be using the SM BTL. I'll follow up on that.

Another similarity is that although the fftw backtrace is just form a sendrecv, the hang was stopped during allreduce, and both OpenFOAM and FFTW cases were doing ompi_coll_base_allreduce_intra_recursivedoubling. However my gut tells me it's not the reduction at fault but rather the progress engine, (partially because I know for a fact we are testing that allreduce function daily without issue).

lrbison commented 5 months ago

Moving the rest of this discussion to https://gitlab.com/eessi/support/-/issues/41

lrbison commented 4 months ago

The root cause was https://github.com/open-mpi/ompi/issues/12270 Fixed in https://github.com/open-mpi/ompi/pull/12338, so this issue can be closed.

rdolbeau commented 6 days ago

For Neoverse V1 users, if you can also try and report on the release-for-testing in #315 it would be useful to get SVE support upstream.

rdolbeau commented 6 days ago

Closing as requested.