Open RussTreadon-NOAA opened 4 weeks ago
The MSU helpdesk replied. They have not experienced any slowdown in executable wall times following the Rocky 9 upgrade. There was one MSU user who's jobs ran slower but this was traced to an errant environment variable.
The GSI build uses the following modulefiles to build gsi.x
and enkf.x
on Orion
Were all these modules recompiled for Orion Rocky 9? Was there any change in build options? Do all the modules perform as efficiently on Rocky 9 and Centos 7?
We're using the same set of versions/build options as before. The only difference that jumps out at me is the change in compiler and MPI. You could try the 'unified-env-intel-2023.2.4' environment (also under spack-stack-1.6.0 on Orion) and see if that makes any difference: /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env-intel-2023.2.4/install/modulefiles/Core Do you have any way of getting insight into whether it could be filesystem vs. network fabric vs...? Such as any info on timing within jobs that could be compared with pre-Rocky runs?
The GSI does not build using /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env-intel-2023.2.4/install/modulefiles/Core
. The GSI requires the gsi-addon-env
to build. I confirmed this by trying unified-env-intel-2023.2.4
. The module load
failed with
orion-login-3:/work2/noaa/da/rtreadon/git/gsi/develop/modulefiles$ module load gsi_orion.intel.lua
Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as
requested: "stack-intel/2021.9.0", "netcdf-fortran/4.6.1", "cmake/3.23.1", "python/3.11.6"
Try: "module spider stack-intel/2021.9.0 netcdf-fortran/4.6.1 cmake/3.23.1 python/3.11.6" to see how to load the
module(s).
Executing this command requires loading "stack-intel/2021.9.0" which failed while processing the following
module(s):
Module fullname Module Filename
--------------- ---------------
gsi_orion.intel /work2/noaa/da/rtreadon/git/gsi/develop/modulefiles/gsi_orion.intel.luaExecuting this command requires loading "python/3.11.6" which failed while processing the following module(s):
Module fullname Module Filename
--------------- ---------------
gsi_orion.intel /work2/noaa/da/rtreadon/git/gsi/develop/modulefiles/gsi_orion.intel.luaExecuting this command requires loading "cmake/3.23.1" which failed while processing the following module(s):
Module fullname Module Filename
--------------- ---------------
gsi_orion.intel /work2/noaa/da/rtreadon/git/gsi/develop/modulefiles/gsi_orion.intel.luaExecuting this command requires loading "netcdf-fortran/4.6.1" which failed while processing the following
module(s):
Module fullname Module Filename
--------------- ---------------
gsi_common /work2/noaa/da/rtreadon/git/gsi/develop/modulefiles/gsi_common.lua
gsi_orion.intel /work2/noaa/da/rtreadon/git/gsi/develop/modulefiles/gsi_orion.intel.lua
We did not profile gsi.x
and enkf.x
on Orion Centos 7. Given that gsi.x
and enkf.x
run efficiently on Hercules, we did not expect slower running executables on Orion Rocky 9.
Watching stdout
as gsi.x
runs suggests that the slowdown occurs while reading either bufr observation dump files or netcdf background files (both read in parallel). This, however, is qualitative statement. We do not have quantitative performance statistics.
Right, sorry, forgot about the addon environment. Please try /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/gsi-addon-env-intel-2023.2.4/install/modulefiles/Core Note that the compiler and MPI versions will need to be updated to '2021.10.0'.
@AlexanderRichert-NOAA , the following modifications were made to a copy of gsi_orion.intel.lua
@@ -1,11 +1,12 @@
help([[
]])
-prepend_path("MODULEPATH", "/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/gsi-addon-env-roc
ky9/install/modulefiles/Core")
+--prepend_path("MODULEPATH", "/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/gsi-addon-env-r
ocky9/install/modulefiles/Core")
+prepend_path("MODULEPATH", "/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/gsi-addon-env-int
el-2023.2.4/install/modulefiles/Core")
local stack_python_ver=os.getenv("stack_python_ver") or "3.11.6"
-local stack_intel_ver=os.getenv("stack_intel_ver") or "2021.9.0"
-local stack_impi_ver=os.getenv("stack_impi_ver") or "2021.9.0"
+local stack_intel_ver=os.getenv("stack_intel_ver") or "2021.10.0"
+local stack_impi_ver=os.getenv("stack_impi_ver") or "2021.10.0"
local cmake_ver=os.getenv("cmake_ver") or "3.23.1"
local prod_util_ver=os.getenv("prod_util_ver") or "2.1.1"
However, module load gsi_orion.intel
fails with
orion-login-4:/work/noaa/da/rtreadon/git/gsi/test_orion/modulefiles$ module load gsi_orion.intel
Lmod has detected the following error: Cannot load module "stack-intel/2021.10.0" without these
module(s) loaded:
icc/2023.2.4
While processing the following module(s):
Module fullname Module Filename
--------------- ---------------
stack-intel/2021.10.0 /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/gsi-addon-env-intel-2023.2.4/install/modulefiles/Core/stack-intel/2021.10.0.lua
gsi_orion.intel /work/noaa/da/rtreadon/git/gsi/test_orion/modulefiles/gsi_orion.intel.lua
Adding load("icc/2023.2.4")
before the stack-intel
load did not help. module spider icc
returns Unable to find: "icc".
Add timers to gsi source code. Comparison of Orion and Hercules gsi.x
timings show greatest wall time difference when processing observations from bufr files
Orion
timer stats: read_obs:finish read obs loop avg 142.585 min 66.6217 max 551.629
Hercules
timer stats: read_obs:finish read obs loop avg 37.0401 min 18.3169 max 144.775
Sorry about that, please try again. I forgot to add a line to the stack-intel module.
@AlexanderRichert-NOAA , got a bit further but still encountered an error. Below is what I see
orion-login-2:/work/noaa/da/rtreadon/git/gsi/test_orion/ush$ ./build.sh
++++ readlink -f -n ./build.sh
+++ dirname /work/noaa/da/rtreadon/git/gsi/test_orion/ush/build.sh
++ cd /work/noaa/da/rtreadon/git/gsi/test_orion/ush/..
++ pwd -P
+ readonly DIR_ROOT=/work/noaa/da/rtreadon/git/gsi/test_orion
+ DIR_ROOT=/work/noaa/da/rtreadon/git/gsi/test_orion
+ BUILD_VERBOSE=YES
+ BUILD_TYPE=Release
+ CMAKE_OPTS=
+ COMPILER=intel
+ BUILD_DIR=/work/noaa/da/rtreadon/git/gsi/test_orion/build
+ INSTALL_PREFIX=/work/noaa/da/rtreadon/git/gsi/test_orion/install
+ GSI_MODE=Regional
+ ENKF_MODE=GFS
+ REGRESSION_TESTS=YES
+ source /work/noaa/da/rtreadon/git/gsi/test_orion/ush/detect_machine.sh
++ [[ -n '' ]]
++ case $(hostname -f) in
+++ hostname -f
++ MACHINE_ID=orion
++ [[ orion == \U\N\K\N\O\W\N ]]
++ MACHINE_ID=orion
++ [[ orion != \U\N\K\N\O\W\N ]]
++ return
+ set +x
Loading icc version 2023.2.4
Loading compiler-rt version 2023.2.4
Currently Loaded Modules:
1) compiler-rt/2023.2.4 12) sqlite/3.43.2 23) openblas/0.3.24 34) wrf-io/1.2.0
2) icc/2023.2.4 13) util-linux-uuid/2.38.1 24) py-setuptools/63.4.3 35) ncio/1.1.2
3) stack-intel/2021.10.0 14) python/3.11.6 25) py-numpy/1.23.4 36) crtm-fix/2.4.0.1_emc
4) mpi/2021.10.0 15) nghttp2/1.57.0 26) bufr/11.7.0 37) git-lfs/3.1.2
5) stack-intel-oneapi-mpi/2021.10.0 16) curl/8.4.0 27) bacio/2.4.1 38) crtm/2.4.0.1
6) zlib/1.2.13 17) cmake/3.23.1 28) w3emc/2.10.0 39) gsi-ncdiag/1.1.2
7) pigz/2.7 18) snappy/1.1.10 29) sp/2.5.0 40) gsi_common
8) zstd/1.5.2 19) c-blosc/1.21.5 30) ip/4.3.0 41) prod_util/2.1.1
9) tar/1.34 20) hdf5/1.14.0 31) sigio/2.3.2 42) intel-oneapi-mkl/2022.2.1
10) gettext/0.21.1 21) netcdf-c/4.9.2 32) sfcio/1.4.1 43) gsi_orion.intel
11) libxcrypt/4.4.35 22) netcdf-fortran/4.6.1 33) nemsio/2.5.4
+ CONTROLPATH=/work/noaa/da/rtreadon/git/gsi/test_orion/../develop/install/bin
+ CMAKE_OPTS+=' -DCMAKE_BUILD_TYPE=Release'
+ CMAKE_OPTS+=' -DCMAKE_INSTALL_PREFIX=/work/noaa/da/rtreadon/git/gsi/test_orion/install'
+ CMAKE_OPTS+=' -DGSI_MODE=Regional -DENKF_MODE=GFS'
+ [[ YES =~ [yYtT] ]]
+ CMAKE_OPTS+=' -DBUILD_REG_TESTING=ON -DCONTROLPATH=/work/noaa/da/rtreadon/git/gsi/test_orion/../develop/install/bin'
+ [[ YES =~ [yYtT] ]]
+ rm -rf /work/noaa/da/rtreadon/git/gsi/test_orion/build
+ mkdir -p /work/noaa/da/rtreadon/git/gsi/test_orion/build
+ cd /work/noaa/da/rtreadon/git/gsi/test_orion/build
+ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/work/noaa/da/rtreadon/git/gsi/test_orion/install -DGSI_MODE=Regional -DENKF_MODE=GFS -DBUILD_REG_TESTING=ON -DCONTROLPATH=/work/noaa/da/rtreadon/git/gsi/test_orion/../develop/install/bin /work/noaa/da/rtreadon/git/gsi/test_orion
-- The C compiler identification is Intel 2021.10.0.20230609
-- The Fortran compiler identification is Intel 2021.10.0.20230609
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /work/noaa/epic/role-epic/spack-stack/orion/compilers/oneapi/compiler/2023.2.4/linux/bin/intel64/icc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Check for working Fortran compiler: /work/noaa/epic/role-epic/spack-stack/orion/compilers/oneapi/compiler/2023.2.4/linux/bin/intel64/ifort - skipped
-- OPENMP ................. OFF
-- ENABLE_MKL ............. ON
-- BUILD_GSDCLOUD ......... OFF
-- BUILD_MGBF ............. ON
-- BUILD_GSI .............. ON
-- BUILD_ENKF ............. ON
-- BUILD_REG_TESTING ...... ON
-- Building MGBF library
-- Building GSI
-- GSI: OPENMP ................. OFF
-- GSI: ENABLE_MKL ............. ON
-- GSI: USE_GSDCLOUD ........... OFF
-- GSI: USE_MGBF ............... ON
-- GSI: GSI_MODE ............... Regional
-- Found MKL: /apps/spack-managed/gcc-11.3.1/intel-oneapi-mkl-2022.2.1-ifnryw36ggh5z5fxmdn2w3n5syxhvnr5/mkl/2022.2.1/lib/intel64/libmkl_intel_lp64.so;/apps/spack-managed/gcc-11.3.1/intel-oneapi-mkl-2022.2.1-ifnryw36ggh5z5fxmdn2w3n5syxhvnr5/mkl/2022.2.1/lib/intel64/libmkl_sequential.so;/apps/spack-managed/gcc-11.3.1/intel-oneapi-mkl-2022.2.1-ifnryw36ggh5z5fxmdn2w3n5syxhvnr5/mkl/2022.2.1/lib/intel64/libmkl_core.so
-- Checking for module 'mpi-c'
-- Package 'mpi-c', required by 'virtual:world', not found
-- Could NOT find MPI_C (missing: MPI_C_LIB_NAMES MPI_C_HEADER_DIR MPI_C_WORKS)
-- Checking for module 'mpi-fort'
-- Package 'mpi-fort', required by 'virtual:world', not found
-- Could NOT find MPI_Fortran (missing: MPI_Fortran_LIB_NAMES MPI_Fortran_F77_HEADER_DIR MPI_Fortran_MODULE_DIR MPI_Fortran_WORKS)
CMake Error at /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env-intel-2023.2.4/install/intel/2021.10.0/cmake-3.23.1-ovxmp4p/share/cmake-3.23/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
Could NOT find MPI (missing: MPI_C_FOUND MPI_Fortran_FOUND)
Call Stack (most recent call first):
/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env-intel-2023.2.4/install/intel/2021.10.0/cmake-3.23.1-ovxmp4p/share/cmake-3.23/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env-intel-2023.2.4/install/intel/2021.10.0/cmake-3.23.1-ovxmp4p/share/cmake-3.23/Modules/FindMPI.cmake:1830 (find_package_handle_standard_args)
src/gsi/CMakeLists.txt:60 (find_package)
-- Configuring incomplete, errors occurred!
See also "/work/noaa/da/rtreadon/git/gsi/test_orion/build/CMakeFiles/CMakeOutput.log".
See also "/work/noaa/da/rtreadon/git/gsi/test_orion/build/CMakeFiles/CMakeError.log".
The modified module file is orion-login-2:/work/noaa/da/rtreadon/git/gsi/test_orion/modulefiles/gsi_orion.intel.lua
. It contains the following local changes
@@ -1,11 +1,12 @@
help([[
]])
-prepend_path("MODULEPATH", "/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/gsi-addon-env-rocky9/install/modulefiles/Core")
+--prepend_path("MODULEPATH", "/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/gsi-addon-env-rocky9/install/modulefiles/Core")
+prepend_path("MODULEPATH", "/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/gsi-addon-env-intel-2023.2.4/install/modulefiles/Co
re")
local stack_python_ver=os.getenv("stack_python_ver") or "3.11.6"
-local stack_intel_ver=os.getenv("stack_intel_ver") or "2021.9.0"
-local stack_impi_ver=os.getenv("stack_impi_ver") or "2021.9.0"
+local stack_intel_ver=os.getenv("stack_intel_ver") or "2021.10.0"
+local stack_impi_ver=os.getenv("stack_impi_ver") or "2021.10.0"
local cmake_ver=os.getenv("cmake_ver") or "3.23.1"
local prod_util_ver=os.getenv("prod_util_ver") or "2.1.1"
dclock
calls added to read_obs.F90
indicate that reading bufr files on Orion takes 3 to 4x longer than the same bufr files on Hercules. Why would processing bufr files be 3 to 4x slower on Orion than Hercules?
Below are wall times in seconds for indicated bufr observation reads on Orion and Hercules.
Orion
21: timer stats: READ_CRIS 28.6474
22: timer stats: READ_CRIS 28.6474
26: timer stats: READ_CRIS 28.6474
23: timer stats: READ_CRIS 28.6474
29: timer stats: READ_CRIS 28.6474
16: timer stats: READ_CRIS 28.6468
31: timer stats: READ_CRIS 28.6474
17: timer stats: READ_CRIS 28.6474
24: timer stats: READ_CRIS 28.6474
18: timer stats: READ_CRIS 28.6474
25: timer stats: READ_CRIS 28.6474
19: timer stats: READ_CRIS 28.6474
27: timer stats: READ_CRIS 28.6474
20: timer stats: READ_CRIS 28.6476
28: timer stats: READ_CRIS 28.6474
30: timer stats: READ_CRIS 28.6474
63: timer stats: READ_IASI 54.4328
62: timer stats: READ_IASI 54.4329
55: timer stats: READ_IASI 54.4332
61: timer stats: READ_IASI 54.4330
59: timer stats: READ_IASI 54.4332
60: timer stats: READ_IASI 54.4330
49: timer stats: READ_IASI 54.4333
50: timer stats: READ_IASI 54.4332
51: timer stats: READ_IASI 54.4333
52: timer stats: READ_IASI 54.4333
53: timer stats: READ_IASI 54.4333
54: timer stats: READ_IASI 54.4332
56: timer stats: READ_IASI 54.4332
57: timer stats: READ_IASI 54.4332
58: timer stats: READ_IASI 54.4331
48: timer stats: READ_IASI 54.4341
15: timer stats: READ_IASI 57.7201
14: timer stats: READ_IASI 57.7204
13: timer stats: READ_IASI 57.7207
12: timer stats: READ_IASI 57.7209
10: timer stats: READ_IASI 57.7245
11: timer stats: READ_IASI 57.7212
8: timer stats: READ_IASI 57.7249
9: timer stats: READ_IASI 57.7249
1: timer stats: READ_IASI 57.7212
2: timer stats: READ_IASI 57.7213
3: timer stats: READ_IASI 57.7241
4: timer stats: READ_IASI 57.7241
5: timer stats: READ_IASI 57.7245
6: timer stats: READ_IASI 57.7250
7: timer stats: READ_IASI 57.7241
0: timer stats: READ_IASI 57.7223
23: timer stats: READ_AVHRR 40.0085
22: timer stats: READ_AVHRR 40.0086
19: timer stats: READ_AVHRR 40.0086
21: timer stats: READ_AVHRR 40.0086
18: timer stats: READ_AVHRR 40.0086
20: timer stats: READ_AVHRR 40.0086
17: timer stats: READ_AVHRR 40.0086
16: timer stats: READ_AVHRR 40.0086
31: timer stats: READ_AVHRR 45.7324
29: timer stats: READ_AVHRR 45.7325
26: timer stats: READ_AVHRR 45.7326
27: timer stats: READ_AVHRR 45.7325
30: timer stats: READ_AVHRR 45.7325
28: timer stats: READ_AVHRR 45.7327
25: timer stats: READ_AVHRR 45.7328
24: timer stats: READ_AVHRR 45.7328
47: timer stats: READ_CRIS 87.6466
46: timer stats: READ_CRIS 87.6467
33: timer stats: READ_CRIS 87.6470
45: timer stats: READ_CRIS 87.6468
34: timer stats: READ_CRIS 87.6469
44: timer stats: READ_CRIS 87.6468
35: timer stats: READ_CRIS 87.6469
43: timer stats: READ_CRIS 87.6468
42: timer stats: READ_CRIS 87.6470
41: timer stats: READ_CRIS 87.6469
40: timer stats: READ_CRIS 87.6470
38: timer stats: READ_CRIS 87.6470
39: timer stats: READ_CRIS 87.6470
36: timer stats: READ_CRIS 87.6470
37: timer stats: READ_CRIS 87.6472
32: timer stats: READ_CRIS 87.6475
95: timer stats: READ_VIIRS 119.945
94: timer stats: READ_VIIRS 119.945
93: timer stats: READ_VIIRS 119.946
91: timer stats: READ_VIIRS 119.946
90: timer stats: READ_VIIRS 119.946
89: timer stats: READ_VIIRS 119.946
81: timer stats: READ_VIIRS 119.946
86: timer stats: READ_VIIRS 119.946
82: timer stats: READ_VIIRS 119.946
92: timer stats: READ_VIIRS 119.946
83: timer stats: READ_VIIRS 119.946
85: timer stats: READ_VIIRS 119.946
88: timer stats: READ_VIIRS 119.946
87: timer stats: READ_VIIRS 119.946
84: timer stats: READ_VIIRS 119.946
80: timer stats: READ_VIIRS 119.946
93: timer stats: READ_OZONE 0.353938
79: timer stats: READ_VIIRS 121.121
78: timer stats: READ_VIIRS 121.122
76: timer stats: READ_VIIRS 121.122
75: timer stats: READ_VIIRS 121.122
77: timer stats: READ_VIIRS 121.122
74: timer stats: READ_VIIRS 121.122
71: timer stats: READ_VIIRS 121.122
73: timer stats: READ_VIIRS 121.122
72: timer stats: READ_VIIRS 121.122
67: timer stats: READ_VIIRS 121.122
68: timer stats: READ_VIIRS 121.122
66: timer stats: READ_VIIRS 121.122
69: timer stats: READ_VIIRS 121.122
70: timer stats: READ_VIIRS 121.122
65: timer stats: READ_VIIRS 121.122
64: timer stats: READ_VIIRS 121.123
85: timer stats: READ_OZONE 1.62620
89: timer stats: READ_OZONE 1.98980
94: timer stats: READ_OZONE 2.11263
81: timer stats: READ_BUFRTOVS 2.68223
75: timer stats: READ_PREPBUFR 4.38559
79: timer stats: READ_NSSTBUFR 5.04728
47: timer stats: READ_AVHRR 40.6250
46: timer stats: READ_AVHRR 40.6249
45: timer stats: READ_AVHRR 40.6250
44: timer stats: READ_AVHRR 40.6249
42: timer stats: READ_AVHRR 40.6249
43: timer stats: READ_AVHRR 40.6249
41: timer stats: READ_AVHRR 40.6248
40: timer stats: READ_AVHRR 40.6250
82: timer stats: READ_BUFRTOVS 8.67870
86: timer stats: READ_BUFRTOVS 8.98850
65: timer stats: READ_BUFRTOVS 8.09661
64: timer stats: READ_BUFRTOVS 8.09578
69: timer stats: READ_BUFRTOVS 8.43694
68: timer stats: READ_BUFRTOVS 8.43661
67: timer stats: READ_BUFRTOVS 8.61997
66: timer stats: READ_BUFRTOVS 8.60470
80: timer stats: READ_GPS 9.92920
36: timer stats: READ_AVHRR 43.3248
39: timer stats: READ_AVHRR 43.3251
35: timer stats: READ_AVHRR 43.3250
38: timer stats: READ_AVHRR 43.3252
33: timer stats: READ_AVHRR 43.3251
37: timer stats: READ_AVHRR 43.3253
34: timer stats: READ_AVHRR 43.3251
32: timer stats: READ_AVHRR 43.3242
88: timer stats: READ_BUFRTOVS 11.3951
74: timer stats: READ_PREPBUFR 11.5022
77: timer stats: READ_PREPBUFR 13.0966
72: timer stats: READ_PREPBUFR 13.3955
15: timer stats: READ_CRIS 85.2315
14: timer stats: READ_CRIS 85.2314
12: timer stats: READ_CRIS 85.2309
13: timer stats: READ_CRIS 85.2311
6: timer stats: READ_CRIS 85.2307
7: timer stats: READ_CRIS 85.2307
8: timer stats: READ_CRIS 85.2307
9: timer stats: READ_CRIS 85.2307
10: timer stats: READ_CRIS 85.2307
11: timer stats: READ_CRIS 85.2307
1: timer stats: READ_CRIS 85.2307
2: timer stats: READ_CRIS 85.2307
3: timer stats: READ_CRIS 85.2307
4: timer stats: READ_CRIS 85.2307
5: timer stats: READ_CRIS 85.2307
0: timer stats: READ_CRIS 85.2297
1: timer stats: READ_OZONE 0.433676
73: timer stats: READ_PREPBUFR 23.1917
70: timer stats: READ_PREPBUFR 23.2368
71: timer stats: READ_PREPBUFR 23.2690
7: timer stats: READ_OZONE 1.97127
3: timer stats: READ_BUFRTOVS 9.23574
76: timer stats: READ_PREPBUFR 36.9617
62: timer stats: READ_ABI 111.597
58: timer stats: READ_ABI 111.597
60: timer stats: READ_ABI 111.598
59: timer stats: READ_ABI 111.597
63: timer stats: READ_ABI 111.598
57: timer stats: READ_ABI 111.597
61: timer stats: READ_ABI 111.598
56: timer stats: READ_ABI 111.597
50: timer stats: READ_ABI 116.369
51: timer stats: READ_ABI 116.369
52: timer stats: READ_ABI 116.369
53: timer stats: READ_ABI 116.369
54: timer stats: READ_ABI 116.369
55: timer stats: READ_ABI 116.369
49: timer stats: READ_ABI 116.369
48: timer stats: READ_ABI 116.366
4: timer stats: READ_AHI 30.5457
2: timer stats: READ_OZONE 35.9112
0: timer stats: READ_OZONE 35.9427
6: timer stats: READ_OZONE 36.9671
91: timer stats: READ_ATMS 115.133
90: timer stats: READ_ATMS 116.143
5: timer stats: READ_ATMS 115.214
84: timer stats: READ_SSMIS 155.442
83: timer stats: READ_SSMIS 172.926
92: timer stats: READ_GMI 318.892
78: timer stats: READ_SATWND 435.596
Hercules
25: timer stats: READ_CRIS 7.89926
18: timer stats: READ_CRIS 7.89897
24: timer stats: READ_CRIS 7.89927
20: timer stats: READ_CRIS 7.89895
26: timer stats: READ_CRIS 7.89927
23: timer stats: READ_CRIS 7.89895
27: timer stats: READ_CRIS 7.89926
16: timer stats: READ_CRIS 7.89836
28: timer stats: READ_CRIS 7.89924
17: timer stats: READ_CRIS 7.89899
29: timer stats: READ_CRIS 7.89933
19: timer stats: READ_CRIS 7.89900
30: timer stats: READ_CRIS 7.89930
21: timer stats: READ_CRIS 7.89901
31: timer stats: READ_CRIS 7.89931
22: timer stats: READ_CRIS 7.89900
63: timer stats: READ_IASI 14.2766
58: timer stats: READ_IASI 14.2768
60: timer stats: READ_IASI 14.2767
59: timer stats: READ_IASI 14.2769
61: timer stats: READ_IASI 14.2767
56: timer stats: READ_IASI 14.2770
62: timer stats: READ_IASI 14.2766
53: timer stats: READ_IASI 14.2770
54: timer stats: READ_IASI 14.2770
55: timer stats: READ_IASI 14.2770
57: timer stats: READ_IASI 14.2769
49: timer stats: READ_IASI 14.2771
50: timer stats: READ_IASI 14.2770
51: timer stats: READ_IASI 14.2771
52: timer stats: READ_IASI 14.2771
48: timer stats: READ_IASI 14.2615
15: timer stats: READ_IASI 15.1522
14: timer stats: READ_IASI 15.1524
13: timer stats: READ_IASI 15.1525
12: timer stats: READ_IASI 15.1524
10: timer stats: READ_IASI 15.1559
11: timer stats: READ_IASI 15.1526
1: timer stats: READ_IASI 15.1526
2: timer stats: READ_IASI 15.1527
3: timer stats: READ_IASI 15.1552
4: timer stats: READ_IASI 15.1559
5: timer stats: READ_IASI 15.1554
6: timer stats: READ_IASI 15.1552
7: timer stats: READ_IASI 15.1555
8: timer stats: READ_IASI 15.1561
9: timer stats: READ_IASI 15.1555
0: timer stats: READ_IASI 15.1533
23: timer stats: READ_AVHRR 10.2355
19: timer stats: READ_AVHRR 10.2356
21: timer stats: READ_AVHRR 10.2355
22: timer stats: READ_AVHRR 10.2355
17: timer stats: READ_AVHRR 10.2356
18: timer stats: READ_AVHRR 10.2356
20: timer stats: READ_AVHRR 10.2356
16: timer stats: READ_AVHRR 10.2356
31: timer stats: READ_AVHRR 11.8628
26: timer stats: READ_AVHRR 11.8629
27: timer stats: READ_AVHRR 11.8629
28: timer stats: READ_AVHRR 11.8629
29: timer stats: READ_AVHRR 11.8629
30: timer stats: READ_AVHRR 11.8628
25: timer stats: READ_AVHRR 11.8630
24: timer stats: READ_AVHRR 11.8553
47: timer stats: READ_CRIS 23.1123
34: timer stats: READ_CRIS 23.1128
44: timer stats: READ_CRIS 23.1125
35: timer stats: READ_CRIS 23.1128
45: timer stats: READ_CRIS 23.1125
33: timer stats: READ_CRIS 23.1130
46: timer stats: READ_CRIS 23.1124
41: timer stats: READ_CRIS 23.1127
42: timer stats: READ_CRIS 23.1126
43: timer stats: READ_CRIS 23.1125
36: timer stats: READ_CRIS 23.1127
37: timer stats: READ_CRIS 23.1127
38: timer stats: READ_CRIS 23.1128
39: timer stats: READ_CRIS 23.1127
40: timer stats: READ_CRIS 23.1127
32: timer stats: READ_CRIS 23.1135
79: timer stats: READ_VIIRS 30.1726
76: timer stats: READ_VIIRS 30.1728
65: timer stats: READ_VIIRS 30.1726
77: timer stats: READ_VIIRS 30.1727
71: timer stats: READ_VIIRS 30.1727
78: timer stats: READ_VIIRS 30.1727
75: timer stats: READ_VIIRS 30.1727
66: timer stats: READ_VIIRS 30.1727
72: timer stats: READ_VIIRS 30.1729
67: timer stats: READ_VIIRS 30.1726
74: timer stats: READ_VIIRS 30.1729
68: timer stats: READ_VIIRS 30.1726
73: timer stats: READ_VIIRS 30.1730
69: timer stats: READ_VIIRS 30.1727
70: timer stats: READ_VIIRS 30.1727
64: timer stats: READ_VIIRS 30.1731
92: timer stats: READ_VIIRS 30.5467
89: timer stats: READ_VIIRS 30.5468
90: timer stats: READ_VIIRS 30.5468
91: timer stats: READ_VIIRS 30.5468
93: timer stats: READ_VIIRS 30.5468
94: timer stats: READ_VIIRS 30.5467
95: timer stats: READ_VIIRS 30.5469
84: timer stats: READ_VIIRS 30.5470
85: timer stats: READ_VIIRS 30.5469
86: timer stats: READ_VIIRS 30.5469
87: timer stats: READ_VIIRS 30.5469
88: timer stats: READ_VIIRS 30.5468
81: timer stats: READ_VIIRS 30.5471
82: timer stats: READ_VIIRS 30.5470
83: timer stats: READ_VIIRS 30.5470
80: timer stats: READ_VIIRS 30.5475
93: timer stats: READ_OZONE 0.115166
94: timer stats: READ_OZONE 0.517453
85: timer stats: READ_OZONE 0.587360
89: timer stats: READ_OZONE 0.649517
81: timer stats: READ_BUFRTOVS 0.829050
75: timer stats: READ_PREPBUFR 1.47294
79: timer stats: READ_NSSTBUFR 1.90115
65: timer stats: READ_BUFRTOVS 2.64807
64: timer stats: READ_BUFRTOVS 2.64672
69: timer stats: READ_BUFRTOVS 2.66557
68: timer stats: READ_BUFRTOVS 2.66498
67: timer stats: READ_BUFRTOVS 2.81424
66: timer stats: READ_BUFRTOVS 2.79706
82: timer stats: READ_BUFRTOVS 2.49398
86: timer stats: READ_BUFRTOVS 2.60691
44: timer stats: READ_AVHRR 10.2207
45: timer stats: READ_AVHRR 10.2207
46: timer stats: READ_AVHRR 10.2207
47: timer stats: READ_AVHRR 10.2208
41: timer stats: READ_AVHRR 10.2206
42: timer stats: READ_AVHRR 10.2206
43: timer stats: READ_AVHRR 10.2206
40: timer stats: READ_AVHRR 10.2208
36: timer stats: READ_AVHRR 11.0368
35: timer stats: READ_AVHRR 11.0369
38: timer stats: READ_AVHRR 11.0368
34: timer stats: READ_AVHRR 11.0372
39: timer stats: READ_AVHRR 11.0368
37: timer stats: READ_AVHRR 11.0369
33: timer stats: READ_AVHRR 11.0373
32: timer stats: READ_AVHRR 11.0363
80: timer stats: READ_GPS 3.62201
88: timer stats: READ_BUFRTOVS 3.62867
74: timer stats: READ_PREPBUFR 4.07518
72: timer stats: READ_PREPBUFR 4.60147
77: timer stats: READ_PREPBUFR 5.14285
15: timer stats: READ_CRIS 21.6790
13: timer stats: READ_CRIS 21.6790
6: timer stats: READ_CRIS 21.6788
14: timer stats: READ_CRIS 21.6790
7: timer stats: READ_CRIS 21.6788
12: timer stats: READ_CRIS 21.6789
8: timer stats: READ_CRIS 21.6789
9: timer stats: READ_CRIS 21.6788
10: timer stats: READ_CRIS 21.6789
11: timer stats: READ_CRIS 21.6789
1: timer stats: READ_CRIS 21.6788
2: timer stats: READ_CRIS 21.6788
3: timer stats: READ_CRIS 21.6788
4: timer stats: READ_CRIS 21.6788
5: timer stats: READ_CRIS 21.6788
0: timer stats: READ_CRIS 21.6780
1: timer stats: READ_OZONE 0.129847
7: timer stats: READ_OZONE 0.651767
73: timer stats: READ_PREPBUFR 8.52165
71: timer stats: READ_PREPBUFR 8.64337
70: timer stats: READ_PREPBUFR 8.68001
3: timer stats: READ_BUFRTOVS 2.54108
60: timer stats: READ_ABI 26.8622
59: timer stats: READ_ABI 26.8622
62: timer stats: READ_ABI 26.8623
57: timer stats: READ_ABI 26.8621
61: timer stats: READ_ABI 26.8623
58: timer stats: READ_ABI 26.8622
63: timer stats: READ_ABI 26.8623
56: timer stats: READ_ABI 26.8621
49: timer stats: READ_ABI 27.4099
50: timer stats: READ_ABI 27.4099
51: timer stats: READ_ABI 27.4099
52: timer stats: READ_ABI 27.4099
53: timer stats: READ_ABI 27.4099
54: timer stats: READ_ABI 27.4100
55: timer stats: READ_ABI 27.4100
48: timer stats: READ_ABI 27.4071
4: timer stats: READ_AHI 6.75684
76: timer stats: READ_PREPBUFR 14.2980
0: timer stats: READ_OZONE 10.9982
2: timer stats: READ_OZONE 11.0451
6: timer stats: READ_OZONE 11.4480
91: timer stats: READ_ATMS 33.2561
90: timer stats: READ_ATMS 33.5066
84: timer stats: READ_SSMIS 36.9816
5: timer stats: READ_ATMS 33.2350
83: timer stats: READ_SSMIS 44.8104
92: timer stats: READ_GMI 76.9536
78: timer stats: READ_SATWND 116.393
As a test compile same GSI code on Hera and Dogwood (WCOSS2) and run gsi.x
using global_4denvar ctest. Processing of the satellite wind bufr file, read_satwnd
, took the most time out of all the bufr readers. Comparison of timings across the machines show timings for Hercules, Hera, and Dogwood are comparable. Orion is the outlier
machine | total read_obs time (s) |
read_satwnd time (s) |
total gsi.x wall time (s) |
---|---|---|---|
dogwood | 169.988 | 120.264 | 403.800888 |
hera | 194.612 | 137.553 | 466.024682 |
hercules | 164.571 | 116.393 | 444.405860 |
orion | 608.372 | 435.489 | 964.591370 |
The same comment has been added to GSI issue #771
@jbathegit, @jack-woollen, @katefriedman-noaa, @DavidHuber-NOAA : have any of you noticed an increase in the time it takes to read bufr files on Orion following the Rocky 9 upgrade?
gsi.x
is seeing significant increases in the time it takes to read global bufr dump files on Orion after the Rocky 9 upgrade. FYI, gsi.x
is built using bufr/11.7.0
on all machines.
See GSI issue #771 for additional details.
bufr/12.0.1 test
Recompile gsi.x
on Orion using bufr/12.0.1
. Rerun global_4denvar ctest. The gsi.x
wall time was 1143.031401 seconds. This is even slower than the bufr/11.7.0
build. read_obs
took 771.891 seconds (read_satwnd
accounts for 562.736 s).
debufr test
Use bufr utility debufr
to decode 5 gdas bufr dump files from the global_4denvar ctests. debufr
was run on WCOSS2 (Dogwood), Hera, Hercules, and Orion. debufr
timings are 3x to 4x greater on Orion than other machines. Open NCEPLIBS-bufr issue #608 to document this behavior.
@AlexanderRichert-NOAA suggested compiling the GSI and ENKF on Hercules then running the ctests on Orion with those executables. After compiling, the gsi_common.lua
was modified to point to the Hercules BUFR installation, then ctests were run. Runtimes were largely the same as reported by @RussTreadon-NOAA:
runtime.global_4denvar_hiproc_contrl.txt: The total amount of wall time = 1394.165947
runtime.global_4denvar_hiproc_updat.txt: The total amount of wall time = 751.449358
runtime.global_4denvar_loproc_contrl.txt: The total amount of wall time = 1059.320163
runtime.global_4denvar_loproc_updat.txt: The total amount of wall time = 986.416513
runtime.global_enkf_hiproc_contrl.txt: The total amount of wall time = 159.291133
runtime.global_enkf_hiproc_updat.txt: The total amount of wall time = 160.180497
runtime.global_enkf_loproc_contrl.txt: The total amount of wall time = 204.069604
runtime.global_enkf_loproc_updat.txt: The total amount of wall time = 186.185513
runtime.hafs_3denvar_hybens_hiproc_contrl.txt: The total amount of wall time = 635.210342
runtime.hafs_3denvar_hybens_hiproc_updat.txt: The total amount of wall time = 625.259733
runtime.hafs_3denvar_hybens_loproc_contrl.txt: The total amount of wall time = 719.252038
runtime.hafs_3denvar_hybens_loproc_updat.txt: The total amount of wall time = 664.469968
runtime.hafs_4denvar_glbens_hiproc_contrl.txt: The total amount of wall time = 686.031241
runtime.hafs_4denvar_glbens_hiproc_updat.txt: The total amount of wall time = 677.697564
runtime.hafs_4denvar_glbens_loproc_contrl.txt: The total amount of wall time = 804.612061
runtime.hafs_4denvar_glbens_loproc_updat.txt: The total amount of wall time = 738.414877
runtime.rtma_hiproc_contrl.txt: The total amount of wall time = 354.441046
runtime.rtma_hiproc_updat.txt: The total amount of wall time = 351.002411
runtime.rtma_loproc_contrl.txt: The total amount of wall time = 362.241694
runtime.rtma_loproc_updat.txt: The total amount of wall time = 366.629561
I asked around, and JCSDA is not seeing specific file I/O cases of severe slow down in applications. But someone mentioned that it is important to use srun
instead of mpiexec
since srun
gets more attention from the Orion IT folks in regard to tuning and optimization. I figure everyone is probably aware of this, so apologies for repeating information, but just wanted to make sure this point is made.
Thanks for checking @srherbener
Update: JCSDA has now seen SOCA running much slower than it should on Orion. Here's are excerpts from a Slack discussion today:
Travis: has anyone noticed orion going slower than it should? I haven't used it since the upgrade, but I'm seeing today the marine skylab 3dvar going a lot slower than it should. I'm not sure if it's orion being orion, or if a regression slipped into soca lately.
Cory mentions this issue
Travis: yep, that tracks... soca var is taking 2x as long as it should
Travis added: yeah, it's more than file I/O though. The 3dvar is slowing down to a crawl inside the minimizer where no I/O is being done
Is that with GNU only (UFS and JEDI), or is that also with Intel?
Is that with GNU only (UFS and JEDI), or is that also with Intel?
@travissluka can you help clarify? Thanks!
Is that with GNU only (UFS and JEDI), or is that also with Intel?
@travissluka can you help clarify? Thanks!
GNU (I didn't try intel), and JEDI only
Would be useful to know if the GSI slowdown reported above for Intel als happens with GNU, and vice versa if the JED slowdown also happens with Intel. From what it sounds like, yes. That would be another good data point that the problem is somewhere in the underlying fabric or slurm integration with it.
Would be useful to know if the GSI slowdown reported above for Intel als happens with GNU, and vice versa if the JED slowdown also happens with Intel. From what it sounds like, yes. That would be another good data point that the problem is somewhere in the underlying fabric or slurm integration with it.
another data point, I'm trying on hercules/intel and it's going just as slow (maybe even slower actually) for SOCA. File I/O doesn't seem to be a particular problem, it seems to be MPI related. If it throw more nodes at it the variational solver goes even slower (inverse scalability!)
I believe I found the issue with the Intel/GCC compilers on Orion. They are bitwise identical to Hercules. The intel-oneapi-compilers/2023.1.0.lua module module file for Orion indicates that the Intel compilers are built for Icelake infrastructure, but Orion's CPUs are Skylake. This may create potentially illegal/less-than-optimal instructions in certain circumstances. I have notified the Orion team and am awaiting a response, but will follow up here when I hear back.
Since the gcc compiler on Orion (/usr/bin/gcc) is identical to Hercules, I believe it may share the same problem. To test this, I installed Intel 2023.1.0 (for Skylake infrastructure) via spack and compiled BUFR with it. The debufr application still runs just as slow this way as with the Icelake installation of Intel compilers. I believe, but cannot prove yet, that this is because Intel is referencing Icelake GNU libraries. To prove it, I think I would need to set up a container on Orion with its own GCC and GNU libraries. Not having much experience with containers, it would be nice to have some help setting this up.
Rebuild GSI develop
at 3e27bb8
on Orion following 7/24/2024 maintenance. No appreciable change in gsi.x
wall time. The GSI runs slower on Orion Rocky 9 than it did on Orion Centos 7.
Describe the bug
gsi.x
andenkf.x
wall times significantly increase following Orion Rocky 9 upgrade. See this comment in GSI issue #754 for details.To Reproduce Steps to reproduce the behavior:
Expected behavior We expect Orion Rocky 9
gsi.x
andenkf.x
wall times to be comparable with what we observed when building and running on Orion Centos 7.System: Orion
Additional context The modulefiles used to build
gsi.x
andenkf.x
aregsi_orion.intel.lua
andgsi_common.lua
.gsi_orion.intel.lua
setsTicket RDHPCS#2024062754000098 has been opened with the Orion helpdesk. Perhaps system settings changed with the Orion Rocky 9 upgrade.