Closed casparvl closed 4 months ago
@casparvl @boegel
I followed this issue here from the EESSI repo. I'm trying to reproduce, but I haven't been able to do so . I've tried gcc 13.2.0, with Open MPI 4.1.6 and Open MPI 4.1.5. I'm running on an AWS hpc7g instance (ubuntu 2204). After being unable to reproduce directly from fftw source, I tried the following easybuild:
eb -dfr --from-pr 18884 --prefix=/fsx/eb --disable-cleanup-builddir
which is based on trying to reproduce https://gist.github.com/boegel/d97b974b8780c93753f1bf7462367082.
After the build, I can run make check
in the builddir, but none of them reproduce the crash. Do you have any other suggestions on how to reproduce?
One observation I have is that all the failures I've seen reported are from mpi-bench. It is true that mpirun may do slightly different things when it detects that it is running as part of a Slurm job. Can you provide any detail about how the slurm job is allocated or launched?
I'm not sure of the exact job characteristics for the test build reported in https://gist.github.com/boegel/d97b974b8780c93753f1bf7462367082
For the builds done in EESSI I also couldn't tell you exactly what resources were requested in the job. But: this is run in a container, and then in a shell in which the only SLURM related job variable that is set is the SLURM_JOB_ID
. So, I'm not sure if there is much for mpirun
to pick up on here to figure out it actually is in a SLURM environment... Of course, SLURM can do things like set cgroups
etc, which potentially affect how things run, but I couldn't tell you if that is done on this cluster. All node allocations here are exclusive, so I don't think a cgroup
would do much anyway (as it would encompass the entire VM).
I did notice that I had fewer failures when I did the building interactively (though still in a job environment, it was an interactive SLURM job), as mentioned here. That seems to confirm that somehow environment has an affect, but... I couldn't really say what. This is a hard one :(
Hm, I suddenly realize one difference between our bot building for EESSI, and your typical interactive environment: the bot not only builds in the container, it builds in a writeable overlay in the container. That tends to be a bit sluggish in terms of I/O. I'm wondering if that can somehow affect how these tests run. It's a bit far-fetched, and I wouldn't be able to explain the mechanism that makes it fail, but it would explain why my own interactive attempts showed a much higher success rate.
Hm, in that container I wonder how many CPUs were allocated to it? I saw it was configured to allow oversubscription, I guess there is probably only 1 CPU core, which is different from my testing...
Our build nodes in AWS have 16 cores (*.4xlarge
instances in AWS), using a single core would be way too slow.
Not sure what @casparvl used for testing interactively
Is there a way for me to get access to that build container so I may try it myself?
Yes, it's part of https://github.com/EESSI/software-layer . Your timing is pretty good, I very recently made a PR to our docs to explain how to use it to replicate build failures. PR isn't merged yet, but it's markdown, so you can simply view a rendered version in my feature branch. Links won't work in there, but I guess you can find your way around if need be - though I think this one markdown doc should cover it all.
Btw, I've tried to reproduce it once again, since we now have a new build cluster (based on Magic Castle instead of Cluster in the Cloud). I've only tried interactively (basically following the docs I just shared), and I cannot for the life of me replicate our own issue. As mentioned in the original issue, interactively I had much higher success rates (9/10 times more or less), but I've ran
perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10 --mpi "mpirun -np 4 `pwd`/mpi-bench"
at least 20 times without failures now.
I'd love to see if the error still occurs when the bot builds it (as there it was consistently failing before), but my initial attempt failed for other reasons (basically, the bot cannot reinstall anything that already exists in the EESSI software stack - if you try, it'll fail on trying to change permissions on a read only file). I'll check with others if there is something I can do to work around this, so that I can actually trigger a rebuild with the bot.
Yeah, I ran it over 200 times without failure on my cluster. Thank you for the pointers in that doc PR. I'll use that to try and trigger it again.
@casparvl Should I temporarily revive a node in our old CitC Slurm cluster, to check if the problem was somehow specific to that environment?
@casparvl I haven't had the time to reproduce within a container. Are we still seeing the testing failures occur or is it not happening on the newer build cluster?
I am still seeing this problem on the our build cluster, when doing a test installation (in an interactive session) of FFTW.MPI/3.3.10-gompi-2023a
for the new EESSI repository software.eessi.io
.
A first attempt resulted in a segfault:
A 2nd attempt showed relative error
again:
I tried to replicate this over the weekend. @casparvl's documentation was extremely helpful, thank you! I tried to debug this PR: https://github.com/EESSI/software-layer/pull/374/files
git clone https://github.com/EESSI/software-layer.git
cd software-layer
git remote add https://github.com/casparvl/software-layer casparvl
git remote add casparvl https://github.com/casparvl/software-layer
git fetch casparvl
git checkout casparvl/fftw_test
./eessi_container.sh --access rw --save /fsx/essi-fftw1
And then within the easybuild container did this in a loop:
eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot
It ran 374 times over the weekend without failure on an hpc7g.16xlarge (64 cores).
@casparvl sounded like you suspected a writable overlay could cause more slugish I/O. I'm not familiar enough with eessi container, but I think with the access rw I have done that, correct?
Do either of you have other ideas for me to change? I suppose I can switch to a c7g.4xlarge....
I was able to compile and successfully run on c7g.4xlarge as well, with no issues there either.
@casparvl Do you have other ideas on how I can try to reproduce? I'm not sure if it matters, but my attempt was on an Ubuntu 2004 and the container was started using: ./eessi_container.sh --access rw --save /fsx/lrbison/essi-fftw1
where the mount was hosted from FSx for Lustre file system.
My repeated testing was repeated calls of eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot
rather than repeatedly starting the container.
Sorry for failing to come back to you on this. I'll try again myself as well. I just did one install, which indeed was succesfull. Second time, I ran into the same error as @boegel had the 2nd time around:
Running it a third time, it completed succesfully again.
The only thing you don't mention explicitely is if you also followed the steps of activating the prefix environment & EESSI pilot stack, as described on https://www.eessi.io/docs/adding_software/debugging_failed_builds/ , and if you sourced the configure_easybuild
script. Did you do that?
If you didn't I guess that means you've built the full software stack from the ground up. If that's the case, and if that works, then I guess the conclusion is something is fishy with one of the FFTW.MPI dependencies we pick up from the EESSI pilot stack (and for which you would have done a fresh build). That's useful information, because it would show that the combination of using the dependencies from EESSI somehow trigger this issue. Also, it'd mean you could actually try those steps as well (i.e. start the prefix environment, start the EESSI pilot stack, source the configure_easybuild
script), and see if you can replicate the issue that way. That would unambiguously prove that the issue is somewhere in the dependencies that we already have in the stack.
Just for reference, this is a snippet of my history
from the point I start the container, to having run the eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot
command once:
1 EESSI_CVMFS_REPO=/cvmfs/pilot.eessi-hpc.org/
2 EESSI_PILOT_VERSION=2023.06
3 source ${EESSI_CVMFS_REPO}/versions/${EESSI_PILOT_VERSION}/init/bash
4 export WORKDIR=$(mktemp --directory --tmpdir=/tmp -t eessi-debug.XXXXXXXXXX)
5 source configure_easybuild
6 module load EasyBuild/4.8.1
7 eb --show-config
8 eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot
The result of eb --show-config
is:
[EESSI pilot 2023.06] $ eb --show-config
#
# Current EasyBuild configuration
# (C: command line argument, D: default value, E: environment variable, F: configuration file)
#
buildpath (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/build
containerpath (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/containers
debug (E) = True
experimental (E) = True
filter-deps (E) = Autoconf, Automake, Autotools, binutils, bzip2, DBus, flex, gettext, gperf, help2man, intltool, libreadline, libtool, Lua, M4, makeinfo, ncurses, util-linux, XZ, zlib, Yasm
filter-env-vars (E) = LD_LIBRARY_PATH
hooks (E) = /home/casparvl/debug_PR374/software-layer/eb_hooks.py
ignore-osdeps (E) = True
installpath (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_n1/testing
module-extensions (E) = True
packagepath (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/packages
prefix (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild
read-only-installdir (E) = True
repositorypath (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/ebfiles_repo
robot-paths (D) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_n1/software/EasyBuild/4.8.1/easybuild/easyconfigs
rpath (E) = True
sourcepath (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/sources:
sysroot (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64
trace (E) = True
zip-logs (E) = bzip2
Curious to hear if you ran using the EESSI pilot stack for dependencies. Maybe you can also share your eb --show-config
output.
I'm also still puzzled by the randomness of this issue. I'd love to better understand why the failrue of these tests are random. Is the input randomly generated? Is the algorithm simply non-deterministic (e.g. because of non-deterministic order in reduction operations or something of that nature)? I'd love to understand if that 'randomness' could somehow be affected by environment, as initially I seem to have seen many more failures in a job environment than interactively... But I'm not sure if any of you has such an intricate knowledge of what these particular tests do :)
Yes, I'm afraid I can't speak for the fftw developers here, perhaps @matteo-frigo could help answer the question about what ../tests/check.pl
is checking, and if the failures are catastrophic or simply small precision errors?
@casparvl
My complete steps are here:
git clone https://github.com/EESSI/software-layer.git
cd software-layer
git remote add https://github.com/casparvl/software-layer casparvl
git remote add casparvl https://github.com/casparvl/software-layer
git fetch casparvl
git checkout casparvl/fftw_test
./eessi_container.sh --access rw --save /fsx/lrbison/essi-fftw1
Apptainer> echo ${EESSI_CVMFS_REPO}; echo ${EESSI_PILOT_VERSION}
/cvmfs/pilot.eessi-hpc.org
2023.06
export EESSI_OS_TYPE=linux # We only support Linux for now
export EESSI_CPU_FAMILY=$(uname -m)
${EESSI_CVMFS_REPO}/versions/${EESSI_PILOT_VERSION}/compat/${EESSI_OS_TYPE}/${EESSI_CPU_FAMILY}/startprefix
#...(wait a bit)
export EESSI_CVMFS_REPO=/cvmfs/pilot.eessi-hpc.org
export EESSI_PILOT_VERSION=2023.06
source ${EESSI_CVMFS_REPO}/versions/${EESSI_PILOT_VERSION}/init/bash
export WORKDIR=/tmp/try1
source configure_easybuild
module load EasyBuild/4.8.1
eb --show-config
eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot
Sadly I didn't save my easybuild output, let me re-create again. I am curious, when you "retry" do you retry from eb --easystack...
or do you retry from ./eessi_container.sh ...
?
Ok, so you also built on top of the dependencies that were already provided from the EESSI side. Then I really don't see any differences, other than (potentially) things in the environment... Strange!
I am curious, when you "retry" do you retry from eb --easystack... or do you retry from ./eessi_container.sh ...?
Like you, I retried from eb --easystack ...
. So, I get different results, even without restarting the container...
Also interesting, I've tried a 4th time. Now I get a hanging process. I.e. I see two lt-mpi-bench
processes using ~100% CPU, and having done so for 66 minutes straight. They normally complete much faster. MPI deadlock...?
I would love a backtrace of both of those processes!
Great idea... but unfortunately my allocation ended 2 minutes after I noticed the hang :( I'm pretty sure I had process hangs before as well, when I ran into this issue originally. I'll try to run it a couple more times tonight, see if I can trigger it again and get a backtrace...
Hm, while trying to reproduce my hang (which I didn't succeed in yet), I noticed something: the automatic initialization script from EESSI thinks this node is a neoverse_n1
. I seem to remember some chatter about this architecture not being detected properly, but thought we fixed that - maybe not. Anyway, it will build against dependencies optimized on neoverse_n1
. I'm pretty sure our build bot overrides this automatic CPU architecture detection, but maybe @boegel can confirm... It would at least point to one difference between what our bot does, and what I get interactively.
Anyway, for now, I'll override myself with export EESSI_SOFTWARE_SUBDIR_OVERRIDE=aarch64/neoverse_v1
before sourcing the init
script. See where that takes me in terms of build failures, hangs, etc.
Interesting, now that I correctly use the right dependencies (due to export EESSI_SOFTWARE_SUBDIR_OVERRIDE=aarch64/neoverse_v1
), the failures are suddenly consistent, instead of occassional. Maybe you could give that a try as well: set it after running startprefix
, but before sourcing the initialization script. Also, at this point, you may unset EESSI_SILENT
. That will course the init script to print what architecture is selected (it should respect your override, but it's good to check).
I've run it about 10-15 times now. Each time, it fails with a numerical error like the one above. Now, finally, I've managed to reproduce the hanging 2 processes. Here's the backtrace:
(gdb) bt full
#0 0x000040002c61c604 in opal_timer_linux_get_cycles_sys_timer ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libopen-pal.so.40
No symbol table info available.
#1 0x000040002c5ccaec in opal_progress_events.isra ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libopen-pal.so.40
No symbol table info available.
#2 0x000040002c5ccc88 in opal_progress () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libopen-pal.so.40
No symbol table info available.
#3 0x000040002c22babc in ompi_request_default_wait () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#4 0x000040002c27e284 in ompi_coll_base_sendrecv_actual ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#5 0x000040002c27f40c in ompi_coll_base_allreduce_intra_recursivedoubling ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#6 0x000040002c27fad4 in ompi_coll_base_allreduce_intra_ring ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#7 0x000040002ea861cc in ompi_coll_tuned_allreduce_intra_dec_fixed ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/openmpi/mca_coll_tuned.so
No symbol table info available.
#8 0x000040002c23b4e8 in PMPI_Allreduce () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#9 0x000040002c0161d0 in fftwf_mpi_any_true ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#10 0x000040002c067648 in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#11 0x000040002c06781c in fftwf_mkplan_d ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#12 0x000040002c01ef0c in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#13 0x000040002c0670e8 in search0 () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#14 0x000040002c0673a4 in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#15 0x000040002c06781c in fftwf_mkplan_d ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#16 0x000040002c01e49c in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#17 0x000040002c0670e8 in search0 () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#18 0x000040002c0673a4 in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#19 0x000040002c0e83ac in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#20 0x000040002c0e85a0 in fftwf_mkapiplan ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#21 0x000040002c017aac in fftwf_mpi_plan_guru_r2r ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#22 0x000040002c017bcc in fftwf_mpi_plan_many_r2r ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#23 0x0000000000404928 in mkplan ()
No symbol table info available.
#24 0x0000000000405778 in setup ()
No symbol table info available.
#25 0x00000000004085e0 in verify ()
No symbol table info available.
#26 0x0000000000406498 in bench_main ()
No symbol table info available.
#27 0x000040002c346a7c in ?? () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64/lib/../lib64/libc.so.6
No symbol table info available.
#28 0x000040002c346b4c in __libc_start_main () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64/lib/../lib64/libc.so.6
No symbol table info available.
#29 0x0000000000402f30 in _start ()
No symbol table info available.
Hm, while trying to reproduce my hang (which I didn't succeed in yet), I noticed something: the automatic initialization script from EESSI thinks this node is a
neoverse_n1
. I seem to remember some chatter about this architecture not being detected properly, but thought we fixed that - maybe not. Anyway, it will build against dependencies optimized onneoverse_n1
. I'm pretty sure our build bot overrides this automatic CPU architecture detection, but maybe @boegel can confirm... It would at least point to one difference between what our bot does, and what I get interactively.
Our bot indeed overrides the CPU auto-detection during building, because archspec
is sometimes a bit too pedantic (see for example https://github.com/archspec/archspec-json/issues/38).
In software.eessi.io
we've switched to our own pure bash archdetect
mechanism, which is less pedantic, but that's not used during build either: the build bot just sets $EESSI_SOFTWARE_SUBDIR_OVERRIDE
based on it's configuration.
Seems like we (you) are making progress! I tried to add your override. Here is my eb config:
buildpath (E) = /tmp/try1/easybuild/build
containerpath (E) = /tmp/try1/easybuild/containers
debug (E) = True
experimental (E) = True
filter-deps (E) = Autoconf, Automake, Autotools, binutils, bzip2, DBus, flex, gettext, gperf, help2man, intltool, libreadline, libtool, Lua, M4, makeinfo, ncurses, util-linux, XZ, zlib, Yasm
filter-env-vars (E) = LD_LIBRARY_PATH
hooks (E) = /tmp/software-layer/eb_hooks.py
ignore-osdeps (E) = True
installpath (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing
module-extensions (E) = True
packagepath (E) = /tmp/try1/easybuild/packages
prefix (E) = /tmp/try1/easybuild
read-only-installdir (E) = True
repositorypath (E) = /tmp/try1/easybuild/ebfiles_repo
robot-paths (D) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/EasyBuild/4.8.1/easybuild/easyconfigs
rpath (E) = True
sourcepath (E) = /tmp/try1/easybuild/sources:
sysroot (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64
trace (E) = True
zip-logs (E) = bzip2
But I still don't get failures during testing.
I do think allreduce has the potential to be non-deterministic, however I'm unsure if the ompi_coll_base_allreduce_intra_ring
implementation is or isn't deterministic.
I wonder, is there a way for me to continually run the test without rebuilding each time?
It is possible. What you could do is stop the EasyBuild installation after a certain point using the --stop
argument. You can do that by editing the yaml file and make it look like this at the end:
- FFTW.MPI-3.3.10-gompi-2022a.eb:
options:
rebuild: True
stop: 'build'
This should stop it after the build step (and before the test step). Then, you'd want to run
eb FFTW.MPI-3.3.10-gompi-2022a.eb --dump-env-script
This will dump a script FFTW.MPI-3.3.10-gompi-2022a.env
that you can source to get the same environment that EasyBuild has during the build. Then, check one of your prior builds (done before you added the 'stop' in the yaml file) to see what command was executed by EasyBuild as its test step
and in which directory. The logs are pretty verbose, so it may be a bit of a puzzle to find, but at least it shows all that information.
Then, source that FFTW.MPI-3.3.10-gompi-2022a.env
, and go to the directory in which EasyBuild normally runs its test step (or an equivalent dir: your tempdir might be different between your stopped build, and the prior build you inspected the logs for. So the prefix might look a little different) and run the command that EasyBuild also ran as 'test step'. That last command, you should be able to put in a loop.
By the way, your installpath
from the eb --show-config
shows that you are indeed using the neoverse_v1
copy of the software stack (which should be the case since you use the override), so that's good.
I'm absolutely puzzled by why things are different for you than for us. Short from seeing if we could have you test things on our cluster, I don't know what else to try for you to reproduce the failure... :/ I that's something you would be up for, see if you can reach out to @boegel on the EESSI Slack in a DM (join here if you're not yet on that channel), he might be able to arrange it for you.
@boegel maybe you could also do the reverse: spin up a regular VM outside of our Magic Castle setup and see if you can reproduce the issue there? If not, it must be related to our cluster setup somehow...
Also a heads up: I'm going to be on quite a long leave, so won't be able to respond for the next month or so. Again, maybe @boegel can follow up if needed :)
Thank you for the testing insight and the slack invite. Enjoy the break. I'll talk to @boegel on slack and see what he thinks is a reasonable next step.
@lrbison When would you like to follow up on this?
I talked offline with Kenneth.
In the mean time, my pattern-matching neurons fired:
both https://github.com/FFTW/fftw3/issues/334#issuecomment-1820587375 and https://gitlab.com/eessi/support/-/issues/24#note_1734228961 have something in common:
Both are in mca_btl_smcuda_component_progress from the smcuda module, but I recall smcuda should really only be engaged when CUDA/ROCm/{accelerator} memory is used, otherwise we should be using the SM BTL. I'll follow up on that.
Another similarity is that although the fftw backtrace is just form a sendrecv, the hang was stopped during allreduce, and both OpenFOAM and FFTW cases were doing ompi_coll_base_allreduce_intra_recursivedoubling. However my gut tells me it's not the reduction at fault but rather the progress engine, (partially because I know for a fact we are testing that allreduce function daily without issue).
Moving the rest of this discussion to https://gitlab.com/eessi/support/-/issues/41
The root cause was https://github.com/open-mpi/ompi/issues/12270 Fixed in https://github.com/open-mpi/ompi/pull/12338, so this issue can be closed.
For Neoverse V1 users, if you can also try and report on the release-for-testing in #315 it would be useful to get SVE support upstream.
Closing as requested.
I've build FFTW on an ARM neoverse_v1 architecture. However, when running the test suite (
make check
) I get occasional failures. The strange thing is: they don't happen consistently, and they don't always happen in the same test. Two example (partial) outputs I've had:i.e. an error in the 3-CPU tests. And:
i.e .a failure in the 4-CPU part of the tests.
When run interactively, I seem to get these failures about 1 out of 10 times. I also experience the occasional hang (looks like a deadlock, but I'm not sure).
We also do this build in an automated (confinuous deployment) environment, where it is build within a SLURM job. For some reason, there, it always seems to fail (or at least the fail rate is high enough that 5 attempts haven't led to a successful run).
My questions here: