ecmwf-ifs / ectrans

Global spherical harmonics transforms library underpinning the IFS
Apache License 2.0
15 stars 30 forks source link

Out of array boundaries accesses #108

Closed antoine-morvan closed 2 weeks ago

antoine-morvan commented 3 weeks ago

Hello,

I was investigating some segfault when ectrans is built Intel ifort, without the -heap-arrays 32 flags. After building both fiat & ectrans in debug mode, the benchmark utility crashes with the following error :

forrtl: severe (408): fort: (3): Subscript #1 of the array ZTSTEP has value 0 which is less than the lower bound of 1

Image              PC                Routine            Line        Source
ectrans-benchmark  000000000042B8E4  MAIN__                    859  ectrans-benchmark.F90
ectrans-benchmark  000000000040F48D  Unknown               Unknown  Unknown
libc-2.28.so       00007F497AC71D85  __libc_start_main     Unknown  Unknown
ectrans-benchmark  000000000040F3AE  Unknown               Unknown  Unknown
Command exited with non-zero status 152

This indicates that some accesses are made out of array boundaries.

To reproduce, simply give -DCMAKE_BUILD_TYPE=Debug to the CMake of fiat and ectrans. With an Intel suite, this leads to the following flags being used :

  c compiler      : Intel 2021.10.0.20230609
    flags         :  -O0 -g -traceback
  fortran compiler: Intel 2021.10.0.20230609
    flags         :  -fast-transcendentals -fp-model precise -fp-speculation=safe -O0 -g -traceback -heap-arrays 32 -check all

Note the -check allon the fortran line : https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/check.html

And the command line to start the benchmark to reproduce the error above :

export OMP_NUM_THREADS=1
mpirun -n 1 ectrans-benchmark-sp -n 1 -l 137 -t 319

Some earlier experiments, before adding the check flags, did show backtraces originating from other places, so I suscpect the one I report above is not the only one.

I tested this with

Best.

samhatfield commented 3 weeks ago

Thanks for this Antoine - I'll take a look next week.

samhatfield commented 2 weeks ago

Well done for spotting that bug @antoine-morvan. The way we were previously calculating the median was pretty dumb. The branch above should resolve this issue.