Open antoine-morvan opened 1 year ago
Thanks Antoine. We'd need to dig into this to understand it more, but right away I can say that Fiat isn't the problem here. The problem is that the iterative algorithm used to compute the points and weights required for Gaussian quadrature in the Legendre transform failed to converge. That's what SUGAW
does. The reason Fiat appears in the backtrace is because it provides the abort handler which ecTrans calls (ABOR1
).
This could be annoying to debug because it means some arithmetic error is happening elsewhere. My guess is that something wrong is happening in this file with those compile options: https://github.com/ecmwf-ifs/ectrans/blob/main/src/trans/internal/cpledn_mod.F90. Are you able to compile with floating-point exception trapping enabled? E.g., with the Cray compiler this is off by default, and so FPEs can manifest as other kinds of errors. You have to add -Ktrap=fp
to trap these exceptions.
I suppose you are mentionning the -ftrapping-math
flag from ACFL (https://developer.arm.com/documentation/101458/2304/Compiler-options/-ftrapping-math?lang=en), which should be enabled by default when using -O3
.
I just launched a run forcing this option, and the output does not change :
Build:
build type : None
timestamp : 20230719150518
op. system : Linux-5.10.179-171.711.amzn2.aarch64 (linux.64)
processor : aarch64
c compiler : Clang 16.0.2
flags : -O3 -mcpu=native -DNDEBUG -ftrapping-math -pipe
fortran compiler: Flang 99.99.1
flags : -O3 -mcpu=native -DNDEBUG -ftrapping-math
Features:
MPI : 1
OMP : 1
MKL : 0
FFTW : 0
TRANSI : 1
Dependencies:
fiat version (1.1.2), git-sha1 bea406a
CMD: mpiexec -n 1 ectrans-benchmark-dp --meminfo --norms -n 20 -f 5 -l 40 --vordiv
CONVERGENCE FAILED IN SUGAW
ALLOWED : 20NECESSARY : 21
ABORT! 1 FAILURE IN SUGAW
ABORT_TRANS CALLED
FAILURE IN SUGAW
SDL_TRACEBACK [PROC=1,THRD=1] ...
[LinuxTraceBack] Backtrace(s) for program './bin/ectrans-benchmark-dp' : sigcontextptr=0xffffe023af80
[LinuxTraceBack] Backtrace (size = 10) with addr2line-cmd
[LinuxTraceBack] /usr/bin/addr2line -fs -e './bin/ectrans-benchmark-dp' 0xffff7fb801f4 0xffff7fbd140c 0xffff7fe15524 0xffff7fe4ba48 0xffff7fe4c6a4 0xffff7fe8699c 0xaaaacf644014 0xaaaacf642510 0xffff7f06bda4 0xaaaacf6423f4
[LinuxTraceBack] [00]: libfiat.so(LinuxTraceBack+0x194) [0xffff7fb801f4] : ??() at ??:0
[LinuxTraceBack] [01]: libfiat.so(sdl_mod_sdl_traceback_+0x16c) [0xffff7fbd140c] : ??() at ??:0
[LinuxTraceBack] [02]: libtrans_dp.so(abort_trans_mod_abort_trans_+0x214) [0xffff7fe15524] : ??() at ??:0
[LinuxTraceBack] [03]: libtrans_dp.so(sugaw_mod_sugaw_+0xe98) [0xffff7fe4ba48] : ??() at ??:0
[LinuxTraceBack] [04]: libtrans_dp.so(suleg_mod_suleg_+0x9d4) [0xffff7fe4c6a4] : ??() at ??:0
[LinuxTraceBack] [05]: libtrans_dp.so(setup_trans_+0x1cdc) [0xffff7fe8699c] : ??() at ??:0
[LinuxTraceBack] [06]: ectrans-benchmark-dp(+0x4014) [0xaaaacf644014] : ??() at ??:0
[LinuxTraceBack] [07]: ectrans-benchmark-dp(+0x2510) [0xaaaacf642510] : ??() at ??:0
[LinuxTraceBack] [08]: libc.so.6(__libc_start_main+0xe4) [0xffff7f06bda4] : ??() at ??:0
[LinuxTraceBack] [09]: ectrans-benchmark-dp(+0x23f4) [0xaaaacf6423f4] : ??() at ??:0
[LinuxTraceBack] End of backtrace(s)
SDL_TRACEBACK [PROC=1,THRD=1] ... DONE
[ip-10-0-7-69:06888] *** Process received signal ***
[ip-10-0-7-69:06888] Signal: Aborted (6)
[ip-10-0-7-69:06888] Signal code: (-6)
[ip-10-0-7-69:06888] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffff80161860]
[ip-10-0-7-69:06888] [ 1] /lib64/libpthread.so.0(raise+0xb0)[0xffff7f2064b0]
[ip-10-0-7-69:06888] [ 2] /shared/efs_home/amorvan/workdir/IFS/ectrans-scripts/run_now/compute/acfl-23.04.1/mcputrap/build/fiat_prefix/lib64/libfiat.so(sdl_mod_sdl_srlabort_+0x10)[0xffff7fbd1500]
[ip-10-0-7-69:06888] [ 3] /shared/efs_home/amorvan/workdir/IFS/ectrans-scripts/run_now/compute/acfl-23.04.1/mcputrap/build/ectrans_prefix/bin/../lib64/libtrans_dp.so(sugaw_mod_sugaw_+0xe98)[0xffff7fe4ba48]
[ip-10-0-7-69:06888] [ 4] /shared/efs_home/amorvan/workdir/IFS/ectrans-scripts/run_now/compute/acfl-23.04.1/mcputrap/build/ectrans_prefix/bin/../lib64/libtrans_dp.so(suleg_mod_suleg_+0x9d4)[0xffff7fe4c6a4]
[ip-10-0-7-69:06888] [ 5] /shared/efs_home/amorvan/workdir/IFS/ectrans-scripts/run_now/compute/acfl-23.04.1/mcputrap/build/ectrans_prefix/bin/../lib64/libtrans_dp.so(setup_trans_+0x1cdc)[0xffff7fe8699c]
[ip-10-0-7-69:06888] [ 6] ./bin/ectrans-benchmark-dp(+0x4014)[0xaaaacf644014]
[ip-10-0-7-69:06888] [ 7] ./bin/ectrans-benchmark-dp(+0x2510)[0xaaaacf642510]
[ip-10-0-7-69:06888] [ 8] /lib64/libc.so.6(__libc_start_main+0xe4)[0xffff7f06bda4]
[ip-10-0-7-69:06888] [ 9] ./bin/ectrans-benchmark-dp(+0x23f4)[0xaaaacf6423f4]
[ip-10-0-7-69:06888] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-10-0-7-69 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
Right, so that's not the problem.
Firstly it's no surprise that you see this problem with both double and single precision, as this part of the code is always double precision regardless of the working precision JPRB
.
Secondly I can suggest something to make the code run, but it won't necessarily be doing the right thing. Can you change this value to 21
and recompile/rerun? This will permit this one case of the iteration taking 21 steps. Maybe this will then work. It will be interesting to look at the error at the end of the program.
If that doesn't work, is it a problem if you just have to disable -mcpu=native
temporarily? The problem is almost certainly in here or here. To debug it I would just try to compare raw values of the relevant arrays between a run with and without -mcpu=native
, and see where they begin to differ. But this might be time consuming and I don't think I will have time to find an ARM machine and do it myself for the time being...
Right, so that's not the problem.
Firstly it's no surprise that you see this problem with both double and single precision, as this part of the code is always double precision regardless of the working precision
JPRB
.Secondly I can suggest something to make the code run, but it won't necessarily be doing the right thing. Can you change this value to
21
and recompile/rerun? This will permit this one case of the iteration taking 21 steps. Maybe this will then work. It will be interesting to look at the error at the end of the program.
here is the log:
mpiexec -n 1 ectrans-benchmark-dp --meminfo --norms -n 20 -f 5 -l 40 --vordiv
======= Start of runtime parameters =======
nsmax 79
grid O80
ndgl 160
nproc 1
nthread 4
nprgpns 1
nprgpew 1
nprtrw 1
nprtrv 1
ngptot 28480
ngptotg 28480
nfld 5
nlev 40
nproma 28480
ngpblks 1
nspec2 6480
nspec2g 6480
luseflt F
lvordiv T
lscders F
luvders F
======= End of runtime parameters =======
transform_test initialisation, on 1 tasks, took 0.12 sec
======= Start of spectral transforms =======
time step 1 took 1.1938 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 2 took 1.1164 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 3 took 1.0858 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 4 took 1.1284 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 5 took 1.1092 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 6 took 1.0529 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 7 took 1.1213 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 8 took 1.1463 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 9 took 1.1557 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 10 took 1.1630 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 11 took 1.1363 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 12 took 1.1598 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 13 took 1.1745 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 14 took 1.1437 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 15 took 1.1547 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 16 took 1.1305 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 17 took 1.1347 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 18 took 1.1759 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 19 took 1.1506 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step 20 took 1.1596 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
======= End of spectral transforms =======
max error zspvor(1:nlev,:) = -0.999E+03
max error zspdiv(1:nlev,:) = -0.999E+03
max error zspsc3a(1:nlev,:,1) = -0.999E+03
max error zspsc2(1:1,:) = -0.999E+03
max error combined = = -0.999E+03
======= Start of time step stats =======
Inverse transforms
------------------
avg (s): 0.5508
min (s): 0.5101
max (s): 0.6185
med (s): 0.5543
Direct transforms
-----------------
avg (s): 0.5889
min (s): 0.5428
max (s): 0.6061
med (s): 0.5934
Inverse-direct transforms
-------------------------
avg (s): 1.1397
min (s): 1.0529
max (s): 1.1938
med (s): 1.1437
loop (s): 22.8352
======= End of time step stats =======
===-=== START OF TIMING STATISTICS ===-===
STATS FOR ALL TASKS
NUM ROUTINE CALLS MEAN(ms) MAX(ms) FRAC(%) UNBAL(%)
0 PROGRAM - Total 1 22959.078 22959.078 100.00 0.00
1 SETUP_TRANS0 - Setup ecTrans 1 0.167 0.167 0.00 0.00
2 SETUP_TRANS - Setup ecTrans handle 1 56.551 56.551 0.25 0.00
3 TIME STEP - Time step 20 1141.762 1141.762 99.46 0.00
4 INV_TRANS - Inverse transform 20 550.749 550.749 47.98 0.00
5 DIR_TRANS - Direct transform 20 588.910 588.910 51.30 0.00
6 NORMS - Norm comp. (optional) 20 2.098 2.098 0.18 0.00
102 LTINV_CTL - Inv. Legendre transform 20 39.602 39.602 3.45 0.00
103 LTDIR_CTL - Dir. Legendre transform 20 42.603 42.603 3.71 0.00
106 FTDIR_CTL - Dir. Fourier transform 20 387.233 387.233 33.73 0.00
107 FTINV_CTL - Inv. Fourier transform 20 411.043 411.043 35.81 0.00
140 SULEG - Comp. of Leg. poly. 1 53.650 53.650 0.23 0.00
152 LTINV_CTL - M to L transposition 20 2.446 2.446 0.21 0.00
153 LTDIR_CTL - L to M transposition 20 11.656 11.656 1.02 0.00
157 FTINV_CTL - L to G transposition 20 93.652 93.652 8.16 0.00
158 FTDIR_CTL - G to L transposition 20 140.805 140.805 12.27 0.00
400 GSTATS - GSTATS itself 1497 0.001 0.001 0.00 0.00
TOTAL MEASURED IMBALANCE = 0.0 SECONDS, 0.0 PERCENT
TOTAL WALLCLOCK TIME 22.959 CPU TIME 22.959 VECTOR TIME 22.959
===-=== END OF TIMING STATISTICS ===-===
EC_MEMINFO@RNSORT: MINCID, MAXCID, NUMNODES = 0 0 1
## EC_MEMINFO
## EC_MEMINFO : MPI-version 3.1
## EC_MEMINFO : Start of MPI-library version
Open MPI v4.1.5, package: Open MPI amorvan@compute-dy-node-3 Distribution, ident: 4.1.5, repo rev: v4.1.5, Feb 23, 2023
## EC_MEMINFO : End of MPI-library version
## EC_MEMINFO : OpenMP-version 4.0.201307
## EC_MEMINFO : CPU-model :
## EC_MEMINFO : Hugepages : 2097152 bytes/page x 0 pages = 0 bytes
## EC_MEMINFO
## EC_MEMINFO
## EC_MEMINFO ********************************************************************************
## EC_MEMINFO *** Mapping of MPI & I/O-tasks to nodes and tasks' thread-to-core affinities ***
## EC_MEMINFO ********************************************************************************
## EC_MEMINFO
## EC_MEMINFO Running on 1 nodes (0-numa) with 1 compute + 0 I/O-tasks and 4+0 threads
## EC_MEMINFO
## EC_MEMINFO # NODE# NODENAME MPI# WORLD# GETPID I/O# MASTER REF# OMP# Core affinities
## EC_MEMINFO = ===== ======== ==== ====== ====== ==== ====== ==== ==== ===============
## EC_MEMINFO
## EC_MEMINFO 0 0 ip-10-0-7-69 0 0 548 [No] Yes 0 4 {0,0,0,0}
## EC_MEMINFO
## EC_MEMINFO | TC | MEMORY USED(MB) MEMORY FREE(MB) | %USED %HUGE | Energy Power LoadAvg
## EC_MEMINFO | Malloc| Inc Heap | | | (J) (W)
## EC_MEMINFO Node Name | Heap | RSS(sum) | Total |
## EC_MEMINFO | (sum) | Small Huge | Memfree+Cached |
## EC_MEMINFO 0 ip-10-0-7-69 0 293 0 2312 3816 4.6 0.0 0 0 2.94 Sm/p:
Obviously it's wrong with max error at -0.999E+03 =)
If that doesn't work, is it a problem if you just have to disable
-mcpu=native
temporarily? The problem is almost certainly in here or here. To debug it I would just try to compare raw values of the relevant arrays between a run with and without-mcpu=native
, and see where they begin to differ. But this might be time consuming and I don't think I will have time to find an ARM machine and do it myself for the time being...
You can reproduce on the A64FX, you just need to setup ACFL 23.04.1. It's free to setup/use nowadays.
I'll dump the arrays if I find some time :)
Hello,
FYI, I tried with latest release of ACFL (24.04). I could not reproduce the issue with that version of the ARM compiler.
I let you try out and close the issue if this works for you.
Best regards.
Hi Antoine, at the moment we do not have access to any machines with the ARM compiler (or at least, I don't personally). So we don't currently have a way to test this. I wish I had a Raspberry Pi right now... I think for now we can just close this issue.
If it's the ARM compiler, it's free, and you can install it in your $HOME.
If it's an ARM cpu, that's another matter :)
If you trust me enough, you can sure close the issue ;)
@samhatfield
Hello,
I am playing with ecTrans on the Graviton3 system. Compiling with ACFL (Arm Compiler for Linux = armclang/armflang) led the app to crash when using some performance flags. I confirmed the issue to happen on other systems with SVE, but not on systems without. Below is a table summarizing my experiments.
The hardware I tested on:
The software stack consists of:
And the command run is
mpiexec -n 1 ./ectrans-benchmark-dp --meminfo --norms -n 20 -f 5 -l 40 --vordiv
. Note that similar behavior occurs with single precision.As we can see, the non-SVE AmepreQ8030 system seems unaffected by this issue, whereas both SVE systems exhibit similar behavior. We can also observe that removing the
-mcpu=native
flag leads to successful run.Typical output looks when crashing like this (here was a run on Graviton3 using the double precision benchmark) :
The build uses following parameters (excerpt from this full script : https://gist.github.com/antoine-morvan/611c4d779fd704279bb0b938598fb597):
Then this benchmark causes the run to fail :
Looking at the backtrace it feels like the problem originates from fiat, but I did not investigate further.
Also, ARM is aware of this issue.
Feel free to ask more details.
Best.