ecmwf-ifs / ectrans

Global spherical harmonics transforms library underpinning the IFS
Apache License 2.0
17 stars 34 forks source link

Crash when compiling with ACFL and '-O3 -mcpu=native' flags #28

Open antoine-morvan opened 1 year ago

antoine-morvan commented 1 year ago

@samhatfield

Hello,

I am playing with ecTrans on the Graviton3 system. Compiling with ACFL (Arm Compiler for Linux = armclang/armflang) led the app to crash when using some performance flags. I confirmed the issue to happen on other systems with SVE, but not on systems without. Below is a table summarizing my experiments.

The hardware I tested on:

The software stack consists of:

And the command run is mpiexec -n 1 ./ectrans-benchmark-dp --meminfo --norms -n 20 -f 5 -l 40 --vordiv. Note that similar behavior occurs with single precision.

System Compiler flags Build Exec Max Error
AmpereQ8030 gcc-13.1.0 perf :white_check_mark: (-O3 -ffast-math -mcpu=native -g -fno-omit-frame-pointer -pipe) :white_check_mark: :white_check_mark: (0.292E-13)
acfl-23.04.1 perf :white_check_mark: (-O3 -mcpu=native -ffp-model=fast -fsimdmath -g -fno-omit-frame-pointer -pipe) :white_check_mark: :white_check_mark: (0.232E-13)
Fuji_A64FX gcc-13.1.0 perf :white_check_mark: (-O3 -ffast-math -mcpu=native -g -fno-omit-frame-pointer -pipe) :white_check_mark: :white_check_mark: (0.576E-13)
acfl-23.04.1 perf :white_check_mark: (-O3 -mcpu=native -ffp-model=fast -fsimdmath -g -fno-omit-frame-pointer -pipe) :x: :zap:
nosimdmath :white_check_mark: (-O3 -mcpu=native -ffp-model=fast -DNDEBUG -pipe) :x: :zap:
mcpu :white_check_mark: (-O3 -mcpu=native -DNDEBUG -pipe) :x: :zap:
normal :white_check_mark: (-O3 -DNDEBUG -pipe) :white_check_mark: :white_check_mark: (0.232E-13)
Graviton3 gcc-13.1.0 perf :white_check_mark: (-O3 -ffast-math -mcpu=native -g -fno-omit-frame-pointer -pipe) :white_check_mark: :white_check_mark: (0.349E-13)
acfl-23.04.1 perf :white_check_mark: (-O3 -mcpu=native -ffp-model=fast -fsimdmath -g -fno-omit-frame-pointer -pipe) :x: :zap:
nosimdmath :white_check_mark: (-O3 -mcpu=native -ffp-model=fast -DNDEBUG -pipe) :x: :zap:
mcpu :white_check_mark: (-O3 -mcpu=native -DNDEBUG -pipe) :x: :zap:
normal :white_check_mark: (-O3 -DNDEBUG -pipe) :white_check_mark: :white_check_mark: (0.232E-13)

As we can see, the non-SVE AmepreQ8030 system seems unaffected by this issue, whereas both SVE systems exhibit similar behavior. We can also observe that removing the -mcpu=native flag leads to successful run.

Typical output looks when crashing like this (here was a run on Graviton3 using the double precision benchmark) :

CMD: mpiexec -n 1 ectrans-benchmark-dp --meminfo --norms -n 20 -f 5 -l 40 --vordiv
 CONVERGENCE FAILED IN SUGAW 
 ALLOWED :   20NECESSARY :   21
 ABORT_TRANS CALLED
  FAILURE IN SUGAW 
 ABORT!    1  FAILURE IN SUGAW 
SDL_TRACEBACK [PROC=1,THRD=1] ...
[LinuxTraceBack] Backtrace(s) for program './bin/ectrans-benchmark-dp' : sigcontextptr=0xffffdbc0c8c0
[LinuxTraceBack] Backtrace (size = 10) with addr2line-cmd
[LinuxTraceBack] /usr/bin/addr2line -fs -e './bin/ectrans-benchmark-dp' 0xffff8a64f010 0xffff8a6a11ac 0xffff8a8e5dc4 0xffff8a91d4a8 0xffff8a91e1f0 0xffff8a95923c 0xaaaaaf7d4014 0xaaaaaf7d2510 0xffff89b3ada4 0xaaaaaf7d23f4
[LinuxTraceBack] [00]: libfiat.so(LinuxTraceBack+0x190) [0xffff8a64f010] : ??() at ??:0
[LinuxTraceBack] [01]: libfiat.so(sdl_mod_sdl_traceback_+0x16c) [0xffff8a6a11ac] : ??() at ??:0
[LinuxTraceBack] [02]: libtrans_dp.so(abort_trans_mod_abort_trans_+0x214) [0xffff8a8e5dc4] : ??() at ??:0
[LinuxTraceBack] [03]: libtrans_dp.so(sugaw_mod_sugaw_+0xe98) [0xffff8a91d4a8] : ??() at ??:0
[LinuxTraceBack] [04]: libtrans_dp.so(suleg_mod_suleg_+0xac0) [0xffff8a91e1f0] : ??() at ??:0
[LinuxTraceBack] [05]: libtrans_dp.so(setup_trans_+0x1cdc) [0xffff8a95923c] : ??() at ??:0
[LinuxTraceBack] [06]: ectrans-benchmark-dp(+0x4014) [0xaaaaaf7d4014] : ??() at ??:0
[LinuxTraceBack] [07]: ectrans-benchmark-dp(+0x2510) [0xaaaaaf7d2510] : ??() at ??:0
[LinuxTraceBack] [08]: libc.so.6(__libc_start_main+0xe4) [0xffff89b3ada4] : ??() at ??:0
[LinuxTraceBack] [09]: ectrans-benchmark-dp(+0x23f4) [0xaaaaaf7d23f4] : ??() at ??:0
[LinuxTraceBack] End of backtrace(s)
SDL_TRACEBACK [PROC=1,THRD=1] ... DONE
[ip-10-0-7-69:01665] *** Process received signal ***
[ip-10-0-7-69:01665] Signal: Aborted (6)
[ip-10-0-7-69:01665] Signal code:  (-6)
[ip-10-0-7-69:01665] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffff8ac30860]
[ip-10-0-7-69:01665] [ 1] /lib64/libpthread.so.0(raise+0xb0)[0xffff89cd54b0]
[ip-10-0-7-69:01665] [ 2] /shared/efs_home/amorvan/workdir/IFS/ectrans-scripts/run_now/compute/acfl-23.04.1/perf/build/fiat_prefix/lib64/libfiat.so(sdl_mod_sdl_srlabort_+0x10)[0xffff8a6a12a0]
[ip-10-0-7-69:01665] [ 3] /shared/efs_home/amorvan/workdir/IFS/ectrans-scripts/run_now/compute/acfl-23.04.1/perf/build/ectrans_prefix/bin/../lib64/libtrans_dp.so(sugaw_mod_sugaw_+0xe98)[0xffff8a91d4a8]
[ip-10-0-7-69:01665] [ 4] /shared/efs_home/amorvan/workdir/IFS/ectrans-scripts/run_now/compute/acfl-23.04.1/perf/build/ectrans_prefix/bin/../lib64/libtrans_dp.so(suleg_mod_suleg_+0xac0)[0xffff8a91e1f0]
[ip-10-0-7-69:01665] [ 5] /shared/efs_home/amorvan/workdir/IFS/ectrans-scripts/run_now/compute/acfl-23.04.1/perf/build/ectrans_prefix/bin/../lib64/libtrans_dp.so(setup_trans_+0x1cdc)[0xffff8a95923c]
[ip-10-0-7-69:01665] [ 6] ./bin/ectrans-benchmark-dp(+0x4014)[0xaaaaaf7d4014]
[ip-10-0-7-69:01665] [ 7] ./bin/ectrans-benchmark-dp(+0x2510)[0xaaaaaf7d2510]
[ip-10-0-7-69:01665] [ 8] /lib64/libc.so.6(__libc_start_main+0xe4)[0xffff89b3ada4]
[ip-10-0-7-69:01665] [ 9] ./bin/ectrans-benchmark-dp(+0x23f4)[0xaaaaaf7d23f4]
[ip-10-0-7-69:01665] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-10-0-7-69 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

The build uses following parameters (excerpt from this full script : https://gist.github.com/antoine-morvan/611c4d779fd704279bb0b938598fb597):

normal)
    export CFLAGS="-O3 -DNDEBUG"
    export FCFLAGS="$CFLAGS"
    export CXXFLAGS="$CXXFLAGS"
    CMAKE_BUILD_TYPE="None"
    ;;
mcpu)
    export CFLAGS="-O3 -mcpu=native -DNDEBUG"
    export FCFLAGS="$CFLAGS"
    export CXXFLAGS="$CXXFLAGS"
    CMAKE_BUILD_TYPE="None"
    ;;

(cd ${fiat_BUILD} && cmake \
        -DCMAKE_BUILD_TYPE="$CMAKE_BUILD_TYPE" \
        -DCMAKE_Fortran_FLAGS="$FCFLAGS" \
        -DCMAKE_C_FLAGS="$CFLAGS" \
        -DCMAKE_INSTALL_PREFIX=${fiat_ROOT} \
        -DENABLE_TESTS=OFF \
        ${fiat_SRC} && make -j && make install)

(cd ${ectrans_BUILD} && cmake \
        -DCMAKE_BUILD_TYPE="$CMAKE_BUILD_TYPE" \
        -DCMAKE_Fortran_FLAGS="$FCFLAGS" \
        -DCMAKE_C_FLAGS="$CFLAGS" \
        -DCMAKE_INSTALL_PREFIX=${ectrans_ROOT} \
        -DENABLE_TESTS=ON \
        ${ectrans_SRC} && make -j && make install)

Then this benchmark causes the run to fail :

export OMP_NUM_THREADS=4
export MPI_NUM_RANKS=1

BINARY=ectrans-benchmark-dp
PROFILEARGS="--meminfo --norms"
ARGS="-n 20 -f 5 -l 40 --vordiv"
(cd ${ectrans_ROOT} \
    && mpirun -n $MPI_NUM_RANKS ./bin/$BINARY $PROFILEARGS $ARGS )

Looking at the backtrace it feels like the problem originates from fiat, but I did not investigate further.

Also, ARM is aware of this issue.

Feel free to ask more details.

Best.

samhatfield commented 1 year ago

Thanks Antoine. We'd need to dig into this to understand it more, but right away I can say that Fiat isn't the problem here. The problem is that the iterative algorithm used to compute the points and weights required for Gaussian quadrature in the Legendre transform failed to converge. That's what SUGAW does. The reason Fiat appears in the backtrace is because it provides the abort handler which ecTrans calls (ABOR1).

This could be annoying to debug because it means some arithmetic error is happening elsewhere. My guess is that something wrong is happening in this file with those compile options: https://github.com/ecmwf-ifs/ectrans/blob/main/src/trans/internal/cpledn_mod.F90. Are you able to compile with floating-point exception trapping enabled? E.g., with the Cray compiler this is off by default, and so FPEs can manifest as other kinds of errors. You have to add -Ktrap=fp to trap these exceptions.

antoine-morvan commented 1 year ago

I suppose you are mentionning the -ftrapping-math flag from ACFL (https://developer.arm.com/documentation/101458/2304/Compiler-options/-ftrapping-math?lang=en), which should be enabled by default when using -O3.

I just launched a run forcing this option, and the output does not change :

Build:
  build type      : None
  timestamp       : 20230719150518
  op. system      : Linux-5.10.179-171.711.amzn2.aarch64 (linux.64)
  processor       : aarch64
  c compiler      : Clang 16.0.2
    flags         : -O3 -mcpu=native -DNDEBUG -ftrapping-math -pipe
  fortran compiler: Flang 99.99.1
    flags         : -O3 -mcpu=native -DNDEBUG -ftrapping-math

Features:
  MPI             : 1
  OMP             : 1
  MKL             : 0
  FFTW            : 0
  TRANSI          : 1

Dependencies:
  fiat version (1.1.2), git-sha1 bea406a
CMD: mpiexec -n 1 ectrans-benchmark-dp --meminfo --norms -n 20 -f 5 -l 40 --vordiv
 CONVERGENCE FAILED IN SUGAW
 ALLOWED :   20NECESSARY :   21
 ABORT!    1  FAILURE IN SUGAW
 ABORT_TRANS CALLED
  FAILURE IN SUGAW
SDL_TRACEBACK [PROC=1,THRD=1] ...
[LinuxTraceBack] Backtrace(s) for program './bin/ectrans-benchmark-dp' : sigcontextptr=0xffffe023af80
[LinuxTraceBack] Backtrace (size = 10) with addr2line-cmd
[LinuxTraceBack] /usr/bin/addr2line -fs -e './bin/ectrans-benchmark-dp' 0xffff7fb801f4 0xffff7fbd140c 0xffff7fe15524 0xffff7fe4ba48 0xffff7fe4c6a4 0xffff7fe8699c 0xaaaacf644014 0xaaaacf642510 0xffff7f06bda4 0xaaaacf6423f4
[LinuxTraceBack] [00]: libfiat.so(LinuxTraceBack+0x194) [0xffff7fb801f4] : ??() at ??:0
[LinuxTraceBack] [01]: libfiat.so(sdl_mod_sdl_traceback_+0x16c) [0xffff7fbd140c] : ??() at ??:0
[LinuxTraceBack] [02]: libtrans_dp.so(abort_trans_mod_abort_trans_+0x214) [0xffff7fe15524] : ??() at ??:0
[LinuxTraceBack] [03]: libtrans_dp.so(sugaw_mod_sugaw_+0xe98) [0xffff7fe4ba48] : ??() at ??:0
[LinuxTraceBack] [04]: libtrans_dp.so(suleg_mod_suleg_+0x9d4) [0xffff7fe4c6a4] : ??() at ??:0
[LinuxTraceBack] [05]: libtrans_dp.so(setup_trans_+0x1cdc) [0xffff7fe8699c] : ??() at ??:0
[LinuxTraceBack] [06]: ectrans-benchmark-dp(+0x4014) [0xaaaacf644014] : ??() at ??:0
[LinuxTraceBack] [07]: ectrans-benchmark-dp(+0x2510) [0xaaaacf642510] : ??() at ??:0
[LinuxTraceBack] [08]: libc.so.6(__libc_start_main+0xe4) [0xffff7f06bda4] : ??() at ??:0
[LinuxTraceBack] [09]: ectrans-benchmark-dp(+0x23f4) [0xaaaacf6423f4] : ??() at ??:0
[LinuxTraceBack] End of backtrace(s)
SDL_TRACEBACK [PROC=1,THRD=1] ... DONE
[ip-10-0-7-69:06888] *** Process received signal ***
[ip-10-0-7-69:06888] Signal: Aborted (6)
[ip-10-0-7-69:06888] Signal code:  (-6)
[ip-10-0-7-69:06888] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffff80161860]
[ip-10-0-7-69:06888] [ 1] /lib64/libpthread.so.0(raise+0xb0)[0xffff7f2064b0]
[ip-10-0-7-69:06888] [ 2] /shared/efs_home/amorvan/workdir/IFS/ectrans-scripts/run_now/compute/acfl-23.04.1/mcputrap/build/fiat_prefix/lib64/libfiat.so(sdl_mod_sdl_srlabort_+0x10)[0xffff7fbd1500]
[ip-10-0-7-69:06888] [ 3] /shared/efs_home/amorvan/workdir/IFS/ectrans-scripts/run_now/compute/acfl-23.04.1/mcputrap/build/ectrans_prefix/bin/../lib64/libtrans_dp.so(sugaw_mod_sugaw_+0xe98)[0xffff7fe4ba48]
[ip-10-0-7-69:06888] [ 4] /shared/efs_home/amorvan/workdir/IFS/ectrans-scripts/run_now/compute/acfl-23.04.1/mcputrap/build/ectrans_prefix/bin/../lib64/libtrans_dp.so(suleg_mod_suleg_+0x9d4)[0xffff7fe4c6a4]
[ip-10-0-7-69:06888] [ 5] /shared/efs_home/amorvan/workdir/IFS/ectrans-scripts/run_now/compute/acfl-23.04.1/mcputrap/build/ectrans_prefix/bin/../lib64/libtrans_dp.so(setup_trans_+0x1cdc)[0xffff7fe8699c]
[ip-10-0-7-69:06888] [ 6] ./bin/ectrans-benchmark-dp(+0x4014)[0xaaaacf644014]
[ip-10-0-7-69:06888] [ 7] ./bin/ectrans-benchmark-dp(+0x2510)[0xaaaacf642510]
[ip-10-0-7-69:06888] [ 8] /lib64/libc.so.6(__libc_start_main+0xe4)[0xffff7f06bda4]
[ip-10-0-7-69:06888] [ 9] ./bin/ectrans-benchmark-dp(+0x23f4)[0xaaaacf6423f4]
[ip-10-0-7-69:06888] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-10-0-7-69 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
samhatfield commented 1 year ago

Right, so that's not the problem.

Firstly it's no surprise that you see this problem with both double and single precision, as this part of the code is always double precision regardless of the working precision JPRB.

Secondly I can suggest something to make the code run, but it won't necessarily be doing the right thing. Can you change this value to 21 and recompile/rerun? This will permit this one case of the iteration taking 21 steps. Maybe this will then work. It will be interesting to look at the error at the end of the program.

If that doesn't work, is it a problem if you just have to disable -mcpu=native temporarily? The problem is almost certainly in here or here. To debug it I would just try to compare raw values of the relevant arrays between a run with and without -mcpu=native, and see where they begin to differ. But this might be time consuming and I don't think I will have time to find an ARM machine and do it myself for the time being...

antoine-morvan commented 1 year ago

Right, so that's not the problem.

Firstly it's no surprise that you see this problem with both double and single precision, as this part of the code is always double precision regardless of the working precision JPRB.

Secondly I can suggest something to make the code run, but it won't necessarily be doing the right thing. Can you change this value to 21 and recompile/rerun? This will permit this one case of the iteration taking 21 steps. Maybe this will then work. It will be interesting to look at the error at the end of the program.

here is the log:

 mpiexec -n 1 ectrans-benchmark-dp --meminfo --norms -n 20 -f 5 -l 40 --vordiv

======= Start of runtime parameters =======

nsmax     79
grid      O80
ndgl      160
nproc     1
nthread   4
nprgpns   1
nprgpew   1
nprtrw    1
nprtrv    1
ngptot    28480
ngptotg   28480
nfld      5
nlev      40
nproma    28480
ngpblks   1
nspec2    6480
nspec2g   6480
luseflt    F
lvordiv    T
lscders    F
luvders    F

======= End of runtime parameters =======

transform_test initialisation, on     1 tasks, took     0.12 sec

======= Start of spectral transforms  =======

time step      1 took  1.1938 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step      2 took  1.1164 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step      3 took  1.0858 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step      4 took  1.1284 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step      5 took  1.1092 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step      6 took  1.0529 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step      7 took  1.1213 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step      8 took  1.1463 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step      9 took  1.1557 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step     10 took  1.1630 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step     11 took  1.1363 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step     12 took  1.1598 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step     13 took  1.1745 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step     14 took  1.1437 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step     15 took  1.1547 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step     16 took  1.1305 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step     17 took  1.1347 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step     18 took  1.1759 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step     19 took  1.1506 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03
time step     20 took  1.1596 | zspvor max err=-0.999E+03 | zspdiv max err=-0.999E+03 | zspsc3a max err=-0.999E+03 | zspsc2 max err=-0.999E+03

======= End of spectral transforms  =======

max error zspvor(1:nlev,:)    = -0.999E+03
max error zspdiv(1:nlev,:)    = -0.999E+03
max error zspsc3a(1:nlev,:,1) = -0.999E+03
max error zspsc2(1:1,:)       = -0.999E+03

max error combined =          = -0.999E+03

======= Start of time step stats =======

Inverse transforms
------------------
avg  (s):   0.5508
min  (s):   0.5101
max  (s):   0.6185
med  (s):   0.5543

Direct transforms
-----------------
avg  (s):   0.5889
min  (s):   0.5428
max  (s):   0.6061
med  (s):   0.5934

Inverse-direct transforms
-------------------------
avg  (s):   1.1397
min  (s):   1.0529
max  (s):   1.1938
med  (s):   1.1437
loop (s):  22.8352

======= End of time step stats =======

===-=== START OF TIMING STATISTICS ===-===

STATS FOR ALL TASKS
 NUM ROUTINE                                     CALLS      MEAN(ms)       MAX(ms)   FRAC(%)  UNBAL(%)
   0 PROGRAM        - Total                          1     22959.078     22959.078    100.00      0.00
   1 SETUP_TRANS0   - Setup ecTrans                  1         0.167         0.167      0.00      0.00
   2 SETUP_TRANS    - Setup ecTrans handle           1        56.551        56.551      0.25      0.00
   3 TIME STEP      - Time step                     20      1141.762      1141.762     99.46      0.00
   4 INV_TRANS      - Inverse transform             20       550.749       550.749     47.98      0.00
   5 DIR_TRANS      - Direct transform              20       588.910       588.910     51.30      0.00
   6 NORMS          - Norm comp. (optional)         20         2.098         2.098      0.18      0.00
 102 LTINV_CTL      - Inv. Legendre transform       20        39.602        39.602      3.45      0.00
 103 LTDIR_CTL      - Dir. Legendre transform       20        42.603        42.603      3.71      0.00
 106 FTDIR_CTL      - Dir. Fourier transform        20       387.233       387.233     33.73      0.00
 107 FTINV_CTL      - Inv. Fourier transform        20       411.043       411.043     35.81      0.00
 140 SULEG          - Comp. of Leg. poly.            1        53.650        53.650      0.23      0.00
 152 LTINV_CTL      - M to L transposition          20         2.446         2.446      0.21      0.00
 153 LTDIR_CTL      - L to M transposition          20        11.656        11.656      1.02      0.00
 157 FTINV_CTL      - L to G transposition          20        93.652        93.652      8.16      0.00
 158 FTDIR_CTL      - G to L transposition          20       140.805       140.805     12.27      0.00
 400 GSTATS         - GSTATS itself               1497         0.001         0.001      0.00      0.00
TOTAL MEASURED IMBALANCE =       0.0 SECONDS,  0.0 PERCENT
TOTAL WALLCLOCK TIME      22.959 CPU TIME     22.959 VECTOR TIME      22.959

===-=== END   OF TIMING STATISTICS ===-===

 EC_MEMINFO@RNSORT: MINCID, MAXCID, NUMNODES =  0 0 1
## EC_MEMINFO
## EC_MEMINFO : MPI-version 3.1
## EC_MEMINFO : Start of MPI-library version
Open MPI v4.1.5, package: Open MPI amorvan@compute-dy-node-3 Distribution, ident: 4.1.5, repo rev: v4.1.5, Feb 23, 2023
## EC_MEMINFO : End of MPI-library version
## EC_MEMINFO : OpenMP-version 4.0.201307
## EC_MEMINFO : CPU-model :
## EC_MEMINFO : Hugepages : 2097152 bytes/page x 0 pages = 0 bytes
## EC_MEMINFO
## EC_MEMINFO
## EC_MEMINFO ********************************************************************************
## EC_MEMINFO *** Mapping of MPI & I/O-tasks to nodes and tasks' thread-to-core affinities ***
## EC_MEMINFO ********************************************************************************
## EC_MEMINFO
## EC_MEMINFO Running on 1 nodes (0-numa) with 1 compute + 0 I/O-tasks and 4+0 threads
## EC_MEMINFO
## EC_MEMINFO      # NODE#             NODENAME    MPI#  WORLD#  GETPID    I/O#  MASTER    REF#    OMP#  Core affinities
## EC_MEMINFO      = =====             ========    ====  ======  ======    ====  ======    ====    ====  ===============
## EC_MEMINFO
## EC_MEMINFO      0     0         ip-10-0-7-69       0       0     548    [No]     Yes       0       4  {0,0,0,0}
## EC_MEMINFO
## EC_MEMINFO                           | TC    | MEMORY USED(MB)  MEMORY FREE(MB) |  %USED %HUGE  | Energy  Power LoadAvg
## EC_MEMINFO                           | Malloc| Inc Heap        |                |               |    (J)    (W)
## EC_MEMINFO Node Name                 | Heap  | RSS(sum)        | Total          |
## EC_MEMINFO                           | (sum) | Small    Huge   | Memfree+Cached |
## EC_MEMINFO    0 ip-10-0-7-69               0     293       0       2312    3816      4.6   0.0         0      0    2.94  Sm/p:

Obviously it's wrong with max error at -0.999E+03 =)

If that doesn't work, is it a problem if you just have to disable -mcpu=native temporarily? The problem is almost certainly in here or here. To debug it I would just try to compare raw values of the relevant arrays between a run with and without -mcpu=native, and see where they begin to differ. But this might be time consuming and I don't think I will have time to find an ARM machine and do it myself for the time being...

You can reproduce on the A64FX, you just need to setup ACFL 23.04.1. It's free to setup/use nowadays.

I'll dump the arrays if I find some time :)

antoine-morvan commented 6 months ago

Hello,

FYI, I tried with latest release of ACFL (24.04). I could not reproduce the issue with that version of the ARM compiler.

I let you try out and close the issue if this works for you.

Best regards.

samhatfield commented 6 months ago

Hi Antoine, at the moment we do not have access to any machines with the ARM compiler (or at least, I don't personally). So we don't currently have a way to test this. I wish I had a Raspberry Pi right now... I think for now we can just close this issue.

antoine-morvan commented 6 months ago

If it's the ARM compiler, it's free, and you can install it in your $HOME.

If it's an ARM cpu, that's another matter :)

If you trust me enough, you can sure close the issue ;)