NOAA-GFDL / pace

Re-write of FV3GFS weather/climate model in Python
Apache License 2.0
13 stars 11 forks source link

large scale run failure with gt4py cpu backend #93

Open xyuan opened 2 weeks ago

xyuan commented 2 weeks ago

when we scale out the gt4py backend using the pace/example test case upto 384 mpi ranks on gaea with/without openmp support, the run crashed with errors,

/ncrc/home1/Xingqiu.Yuan/miniconda3/envs/py3119/lib/python3.11/site-packages/gridtools_cpp/data/include/gridtools/stencil/cpu_kfirst.hpp(78): error: no instance of overloaded function "gridtools::sid::shift" matches the argument list argument types are: (ptr_diff_t, gridtools::sid::default_stride, ) sid::shift( ^ /ncrc/home1/Xingqiu.Yuan/miniconda3/envs/py3119/lib/python3.11/site-packages/gridtools_cpp/data/include/gridtools/stencil/frontend/../../sid/concept.hpp(658): note: this candidate was rejected because at least one template argument could not be deduced using conceptimpl::shift;

wtih openmp support, we have the following error,

/ncrc/home1/Xingqiu.Yuan/miniconda3/envs/py3119/lib/python3.11/site-packages/gridtools_cpp/data/include/gridtools/stencil/cpu_ifirst/loops.hpp(131): warning #16219: Some OpenMP processing was skipped to constrain compile time. Consider overriding limits (-qoverride-limits). srun: error: c5n0890: task 90: Exited with exit code 1 srun: error: c5n1563: task 260: Killed srun: error: c5n0890: task 111: Exited with exit code 1 srun: error: c5n1563: tasks 279,282: Killed srun: error: c5n0890: task 98: Exited with exit code 1 srun: error: c5n1563: tasks 258,261,264,267,272,274,287,294,297: Killed

however, when the same case running with DaCe backend, it works fine.

Describe the system environment, include: the modules used for the test are, (base) Xingqiu.Yuan@gaea56:/gpfs/f5/gfdl_f/scratch/Xingqiu.Yuan/pace> module list

Currently Loaded Modules: 1) craype-x86-rome 7) cray-mpich/8.1.25 13) TimeZoneEDT/default 19) uberftp/2_8 25) cray-netcdf/4.9.0.3 2) craype-network-ofi 8) cray-libsci/23.02.1.1 14) DefApps/default 20) gcp/2.3 26) intel-oneapi/2023.1.0 3) perftools-base/23.03.0 9) PrgEnv-intel/8.3.3 15) nccmp/1.9.0.1 21) hsm/1.3.0 27) boost/1.79.0 4) xpmem/2.6.2-2.5_2.38__gd067c3f.shasta 10) cray-pmi/6.1.10 16) nco/5.0.1 22) perlbrew/5.28.0 5) craype/2.7.20 11) darshan-runtime/3.4.0 17) fre-nctools/2024.03 23) fre/bronx-22 6) cray-dsmml/0.2.2 12) CmrsEnv/default 18) gridcf-gct/6.2.20220524 24) cray-hdf5/1.12.2.3

when change it to gcc compiler, we have similar error