earth-system-radiation / rte-rrtmgp

RTE+RRTMGP is a set of codes for computing radiative fluxes in planetary atmospheres.
BSD 3-Clause "New" or "Revised" License
74 stars 67 forks source link

Introduce a workaround for Intel Fortran Compiler Classic 2021 and later #170

Closed skosukhin closed 2 years ago

skosukhin commented 2 years ago

This introduces a workaround for the vectorization problem reported in #159.

RobertPincus commented 2 years ago

Thanks @skosukhin. In principle I am willing to accept this patch in the short term. Would it be possible, do you think, to create a small test that fails if this patch is not applied, and passes if it does? You could add this test in a separate PR. this would be great because we would have a way of knowing if/when the patch was no longer needed.

skosukhin commented 2 years ago

@RobertPincus the existing tests seem to be enough (see https://github.com/earth-system-radiation/rte-rrtmgp/issues/159#issuecomment-1067983937). I have managed to reproduce it on Levante@DKRZ with the current develop. If we revert 9124ecc and then run:

cd rte-rrtmgp
module load intel-oneapi-compilers python3
export FC=ifort
export FCFLAGS='-m64 -O3 -g -traceback -heap-arrays -assume realloc_lhs -extend-source 132'
export NCHOME=/sw/spack-levante/netcdf-c-4.8.1-2k3cmu
export NFHOME=/sw/spack-levante/netcdf-fortran-4.5.3-k6xq5g
export LD_LIBRARY_PATH="$NCHOME/lib:$NFHOME/lib:$LD_LIBRARY_PATH"
export RRTMGP_ROOT=$(pwd)
ulimit -s unlimited

make libs tests check 

I get the following:

make[1]: Entering directory '/home/m/m300488/rte-rrtmgp/examples/rfmip-clear-sky'
# Files need to have been generated/downloaded before
./rrtmgp_rfmip_lw 8 multiple_input4MIPs_radiation_RFMIP_UColorado-RFMIP-1-2_none.nc /home/m/m300488/rte-rrtmgp/rrtmgp/data/rrtmgp-data-lw-g256-2018-12-04.nc
 Usage: rrtmgp_rfmip_lw [block_size] [rfmip_file] [k-distribution_file] [forcing
 _index (1,2,3)] [physics_index (1,2)]
 Doing          225 blocks of size            8
 Calculation uses RFMIP gases: h2o carbon_dioxide o3 nitrous_oxide 
 carbon_monoxide methane oxygen nitrogen carbon_tetrachloride cfc11 cfc12 
 hcfc22 hfc143a hfc125 hfc23 hfc32 hfc134a cf4 no2 
[levante5:3241085:0:3241085] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffffffc8052d1bb8)
==== backtrace (tid:3241085) ====
 0 0x0000000000012b20 .annobin_sigaction.c()  sigaction.c:0
 1 0x000000000044de0d interpolate3d_byflav()  /home/m/m300488/rte-rrtmgp/build/../rrtmgp/kernels/mo_gas_optics_kernels.F90:725
 2 0x000000000044de0d gas_optical_depths_major()  /home/m/m300488/rte-rrtmgp/build/../rrtmgp/kernels/mo_gas_optics_kernels.F90:318
 3 0x000000000044de0d compute_tau_absorption()  /home/m/m300488/rte-rrtmgp/build/../rrtmgp/kernels/mo_gas_optics_kernels.F90:220
 4 0x00000000004381ad mo_gas_optics_rrtmgp_mp_compute_gas_taus_()  /home/m/m300488/rte-rrtmgp/build/../rrtmgp/mo_gas_optics_rrtmgp.F90:653
 5 0x0000000000431d78 mo_gas_optics_rrtmgp_mp_gas_optics_int_()  /home/m/m300488/rte-rrtmgp/build/../rrtmgp/mo_gas_optics_rrtmgp.F90:259
 6 0x000000000040de67 MAIN__()  /home/m/m300488/rte-rrtmgp/examples/rfmip-clear-sky/rrtmgp_rfmip_lw.F90:270
 7 0x000000000040bde2 main()  ???:0
 8 0x0000000000023493 __libc_start_main()  ???:0
 9 0x000000000040bcee _start()  ???:0
=================================
make[1]: *** [Makefile:66: tests] Segmentation fault (core dumped)
make[1]: Leaving directory '/home/m/m300488/rte-rrtmgp/examples/rfmip-clear-sky'

It works if we either reduce the optimization level from -O3 to -O1 or apply the patch from this PR.

RobertPincus commented 2 years ago

@skosukhin Do you think there's any chance of automating this test, or of generally automating continuous testing on Levante? At CSCS we have a solution where a background process runs on a login node and polls Github for code updates, though CSCS is also moving to some more robust approach based on container. Does DKRZ offer CI on Levante? Could it be made to work for codes hosted on Github?

skosukhin commented 1 year ago

@RobertPincus If I take the current develop, remove the trailing & here and run the following script on Levante@DKRZ, everything seems to work fine.

cd rte-rrtmgp
module load intel-oneapi-compilers python3
export FC=ifort
export FCFLAGS='-m64 -O3 -g -traceback -heap-arrays -assume realloc_lhs -extend-source 132'
export NCHOME=/sw/spack-levante/netcdf-c-4.8.1-2k3cmu
export NFHOME=/sw/spack-levante/netcdf-fortran-4.5.3-k6xq5g
export LD_LIBRARY_PATH="$NCHOME/lib:$NFHOME/lib:$LD_LIBRARY_PATH"
export RRTMGP_ROOT=$(pwd)
export RRTMGP_DATA="$(pwd)/rrtmgp-data"
export FAILURE_THRESHOLD='7.e-4'
ulimit -s unlimited

git clone --branch develop --depth 1 git@github.com:earth-system-radiation/rrtmgp-data.git

make libs tests check

$FC --version

If, however, I revert 9124ecc, I get the following error:

make[1]: Entering directory '/path/to/rte-rrtmgp/examples/rfmip-clear-sky'
cp /path/to/rte-rrtmgp/rrtmgp-data/examples/rfmip-clear-sky/inputs/multiple_input4MIPs_radiation_RFMIP_UColorado-RFMIP-1-2_none.nc . 
cp /path/to/rte-rrtmgp/rrtmgp-data/examples/rfmip-clear-sky/inputs/rld_Efx_RTE-RRTMGP-181204_rad-irf_r1i1p1f1_gn.nc . 
cp /path/to/rte-rrtmgp/rrtmgp-data/examples/rfmip-clear-sky/inputs/rlu_Efx_RTE-RRTMGP-181204_rad-irf_r1i1p1f1_gn.nc . 
cp /path/to/rte-rrtmgp/rrtmgp-data/examples/rfmip-clear-sky/inputs/rsd_Efx_RTE-RRTMGP-181204_rad-irf_r1i1p1f1_gn.nc . 
cp /path/to/rte-rrtmgp/rrtmgp-data/examples/rfmip-clear-sky/inputs/rsu_Efx_RTE-RRTMGP-181204_rad-irf_r1i1p1f1_gn.nc . 
./rrtmgp_rfmip_lw 8 multiple_input4MIPs_radiation_RFMIP_UColorado-RFMIP-1-2_none.nc /path/to/rte-rrtmgp/rrtmgp-data/rrtmgp-gas-lw-g256.nc
 Usage: rrtmgp_rfmip_lw [block_size] [rfmip_file] [k-distribution_file] [forcing
 _index (1,2,3)] [physics_index (1,2)]
 Doing          225 blocks of size            8
 Calculation uses RFMIP gases: h2o carbon_dioxide o3 nitrous_oxide 
 carbon_monoxide methane oxygen nitrogen carbon_tetrachloride cfc11 cfc12 
 hcfc22 hfc143a hfc125 hfc23 hfc32 hfc134a cf4 no2 
[levante6:1363338:0:1363338] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffffffc805262e98)
==== backtrace (tid:1363338) ====
 0 0x0000000000012ce0 __funlockfile()  :0
 1 0x00000000004501ad interpolate3d_byflav()  /path/to/rte-rrtmgp/build/../rrtmgp-kernels/mo_gas_optics_rrtmgp_kernels.F90:759
 2 0x00000000004501ad gas_optical_depths_major()  /path/to/rte-rrtmgp/build/../rrtmgp-kernels/mo_gas_optics_rrtmgp_kernels.F90:376
 3 0x00000000004501ad compute_tau_absorption()  /path/to/rte-rrtmgp/build/../rrtmgp-kernels/mo_gas_optics_rrtmgp_kernels.F90:278
 4 0x0000000000438b9d mo_gas_optics_rrtmgp_mp_compute_gas_taus_()  /path/to/rte-rrtmgp/build/../rrtmgp-frontend/mo_gas_optics_rrtmgp.F90:676
 5 0x0000000000432758 mo_gas_optics_rrtmgp_mp_gas_optics_int_()  /path/to/rte-rrtmgp/build/../rrtmgp-frontend/mo_gas_optics_rrtmgp.F90:265
 6 0x000000000040de67 MAIN__()  /path/to/rte-rrtmgp/examples/rfmip-clear-sky/rrtmgp_rfmip_lw.F90:246
 7 0x000000000040bde2 main()  ???:0
 8 0x000000000003acf3 __libc_start_main()  ???:0
 9 0x000000000040bcee _start()  ???:0
=================================
make[1]: *** [Makefile:56: tests] Segmentation fault
make[1]: Leaving directory '/path/to/rte-rrtmgp/examples/rfmip-clear-sky'
make: *** [Makefile:14: tests] Error 2
ifort (IFORT) 2021.5.0 20211109
Copyright (C) 1985-2021 Intel Corporation.  All rights reserved.

Does this answer your question?

Regarding CI at DKRZ, I haven't heard about any support for Github. Probably, we could try to organize something via a Gitlab mirror hosted at DKRZ but I won't have time for that in the foreseeable future, unfortunately. Although, we might not need Levante to reproduce this. Maybe the issue can be reproduced inside the Intel container? I am coming back from vacation in a couple of weeks and will also try not to forget to check whether newer compiler versions are still affected.