NOAA-EMC / hpc-stack

Create a software stack for HPC's
GNU Lesser General Public License v2.1
29 stars 35 forks source link

Updated build scripts, config files, and stack yaml files #518

Closed natalie-perlin closed 1 year ago

natalie-perlin commented 1 year ago

Updates to build script in libs/ directory, config files for several platforms, and stack yaml files with higher versions of libraries to build


The following updates were made:

  1. ./libs_build_udunits.sh - update the URL to download the source code, it is now changed to URL=https://downloads.unidata.ucar.edu/udunits/$version/$software.tar.gz

  2. ./libs/build_nceplibs.sh 2.1. Allow building crtm/2.4.0 library with MPI, pass the CMAKE_Fortran_COMPILER=${FC} variable during cmake stage. Requires a module template for mpi-version of crtm.lua modulefile (created). 2.2. Updated the location to download fix files for both v.2.3.0 and v.2.4.0, it is now http://ftp.ssec.wisc.edu/pub/s4/CRTM/$crtm_tarball 2.3. Updated the recipe to unroll the fix file tarballs for the either v.2.3.0 or v 2.4.0, and install them in the location of the corresponding version crtm library. 2.4. A higher version of ip/4.0.0 build requires the sp module to be loaded.

  3. ./libs/build_netcdf.sh - added two options during the configure stage needed to build netcdf higher than >v4.9.0

--disable-libxml2 \
--disable-byterange \
  1. ./libs/build_mapl.sh - added an option during cmake stage to build versions higher than >=v2.34.0 -DBUILD_WITH_FARGPARSE=NO \

    1. ./build_met.sh - add missing library during linking: export LIBS="-lhdf5_hl -lhdf5 -lz -ldl"

    2. In ./config/directory updated config_hera.sh, config_jet.sh, config_macos_gnu.sh, config_gaea.sh, config_orion.sh, config_cheyenne_intel.sh, config_cheyenne_gnu.sh similar to the configurations used to build EPIC-maintained stacks on corresponding platforms. Added config_noaacloud.shto build stack on NOAA cloud providers (additional library builds, such as git, git-lfs, m4 may need to be installed separately to allow the builds). Location of the hpc-stack on AWS provided by ParallelWorks: /contrib/EPIC/hpc-stack/src-intel-2021.3.0/ (source and build logs) /contrib/EPIC/hpc-stack/intel-2021.3.0/ (installation location)

    3. In ./stack/ directory updated: stack_custom.yaml, stack_noaa.yaml, stack_macos.yaml , higher version of libraries that are predominantly used by UFS-SRW and UFS-WM

    4. Documentation correction: ./docs/source/hpc-install.rst, mac-install.rst. Use of term "native" in compiler/python question during modules setup is explained correctly.

    5. A modulefile template crtm.lua for mpi version: ./mpi/compilerName/compilerVersion/mpiName/mpiVersion/crtm/crtm.lua

    6. Updated ./libs/build_hdf5.sh : remove the compiler option --enable-static-exec.

natalie-perlin commented 1 year ago

Looks good. One minor typo that should be corrected in a future pr.

Corrected here.

aerorahul commented 1 year ago

@natalie-perlin Can you please ensure that the tests are passing? It seems like there is an issue in building the CRTM from the GH action logs.

Hang-Lei-NOAA commented 1 year ago

The changes about findnetcdf in the build_nceplibs.sh may need to be modified.

DavidHuber-NOAA commented 1 year ago

@natalie-perlin Do you mind if I piggy back on this? I can update the S4 configuration file and add myself as the installer in the README with a PR into your branch.

natalie-perlin commented 1 year ago

@natalie-perlin Do you mind if I piggy back on this? I can update the S4 configuration file and add myself as the installer in the README with a PR into your branch.

Sure, thank you!

natalie-perlin commented 1 year ago

The changes about findnetcdf in the build_nceplibs.sh may need to be modified.

Any suggestions on what may need modification?

I've tested a build of crtm/2.4.0 without the modifications to build_nceplibs.sh and FindNetCDF.cmake. On most of the platforms (Hera, Cheyenne, Jet, MacOS), except Gaea, the build could be completed without the modifications to the ./crtm-v2.4.0/cmake/FindNetCDF.cmake that comments out line 186, find_package(MPI REQUIRED) The issue on Gaea appears to be related to cmake version expecting certain variables set when Fortran MPI is used, namely, MPI_Fortran_LIB_NAMES, MPI_Fortran_F77_HEADER_DIR, MPI_Fortran_MODULE_DIR, MPI_Fortran_WORKS. They are not set by cray environment of cray-mpich on Gaea. Some relevant info from the log file /lustre/f2/dev/role.epic/contrib/hpc-stack/src-intel-test3/log/crtm.log :

Currently Loaded Modules:
  1) modules/3.2.11.4                                 19) alps/6.6.59-7.0.2.1_3.85__g872a8d62.ari
  2) CmrsEnv                                          20) atp/2.1.3
  3) TimeZoneEDT                                      21) rca/2.2.20-7.0.2.1_2.93__g8e3fb5b.ari
  4) globus-toolkit/6.0.17                            22) perftools-base/7.1.3
  5) darshan/3.2.1                                    23) PrgEnv-intel/6.0.5
  6) DefApps                                          24) craype-broadwell
  7) craype-network-aries                             25) cmake/3.20.1
  8) eproxy/2.0.24-7.0.2.1_2.37__g8e04b33.ari         26) miniconda3/4.12.0
  9) craype/2.6.3                                     27) git/2.31.1
 10) cray-libsci/19.06.1                              28) git-lfs/2.11.0
 11) udreg/2.3.2-7.0.2.1_2.52__g8175d3d.ari           29) hpc/1.2.0
 12) ugni/6.0.14.0-7.0.2.1_3.77__ge78e5b0.ari         30) intel/2021.3.0
 13) pmi/5.0.15                                       31) hpc-intel/2021.3.0
 14) dmapp/7.1.1-7.0.2.1_2.98__g38cf134.ari           32) cray-mpich/7.7.11
 15) gni-headers/5.0.12.0-7.0.2.1_2.34__g3b1768f.ari  33) hpc-cray-mpich/7.7.11
 16) xpmem/2.2.20-7.0.2.1_2.72__g87eb960.ari          34) hdf5/1.10.6
 17) job/2.2.4-7.0.2.1_2.86__g36b56f4.ari             35) netcdf/4.7.4
 18) dvs/2.12_2.2.177-7.0.2.1_13.5__g0b75e43d

 ...
 + cd crtm-v2.4.0
+ [[ -d build ]]
+ mkdir -p build
+ cd build
+ cmake .. -DCMAKE_INSTALL_PREFIX=/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-test3/intel-2021.3.0/crtm/2.4.0 -DENABLE_TESTS=OFF -DOPENMP=OFF
-- The Fortran compiler identification is Intel 20.2.3.20210609
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Check for working Fortran compiler: /opt/intel/oneapi/compiler/2021.3.0/linux/bin/intel64/ifort - skipped
-- Checking whether /opt/intel/oneapi/compiler/2021.3.0/linux/bin/intel64/ifort supports Fortran 90
-- Checking whether /opt/intel/oneapi/compiler/2021.3.0/linux/bin/intel64/ifort supports Fortran 90 - yes
-- Setting build type to 'Release' as none was specified.
-- Found OpenMP_Fortran: -qopenmp (found version "5.0") 
-- Found OpenMP: TRUE (found version "5.0") found components: Fortran 
-- Could NOT find MPI_Fortran (missing: MPI_Fortran_LIB_NAMES MPI_Fortran_F77_HEADER_DIR MPI_Fortran_MODULE_DIR MPI_Fortran_WORKS) 
CMake Error at /ncrc/sw/gaea-cle7/uasw/ncrc/envs/20200417/opt/linux-sles15-x86_64/gcc-7.5.0/cmake-3.20.1-w7tkahac22qulhbcbi6io54u5dfr36zs/share/cmake-3.20/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find MPI (missing: MPI_Fortran_FOUND)
Call Stack (most recent call first):
  /ncrc/sw/gaea-cle7/uasw/ncrc/envs/20200417/opt/linux-sles15-x86_64/gcc-7.5.0/cmake-3.20.1-w7tkahac22qulhbcbi6io54u5dfr36zs/share/cmake-3.20/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
  /ncrc/sw/gaea-cle7/uasw/ncrc/envs/20200417/opt/linux-sles15-x86_64/gcc-7.5.0/cmake-3.20.1-w7tkahac22qulhbcbi6io54u5dfr36zs/share/cmake-3.20/Modules/FindMPI.cmake:1742 (find_package_handle_standard_args)
  cmake/FindNetCDF.cmake:186 (find_package)
  CMakeLists.txt:28 (find_package)

-- Configuring incomplete, errors occurred!
See also "/lustre/f2/dev/role.epic/contrib/hpc-stack/src-intel-test3/pkg/crtm-v2.4.0/build/CMakeFiles/CMakeOutput.log".

How could I view the GH logs to compare the issues reported there?

aerorahul commented 1 year ago

The changes about findnetcdf in the build_nceplibs.sh may need to be modified.

Any suggestions on what may need modification?

I've tested a build of crtm/2.4.0 without the modifications to build_nceplibs.sh and FindNetCDF.cmake. On most of the platforms (Hera, Cheyenne, Jet, MacOS), except Gaea, the build could be completed without the modifications to the ./crtm-v2.4.0/cmake/FindNetCDF.cmake that comments out line 186, find_package(MPI REQUIRED) The issue on Gaea appears to be related to cmake version expecting certain variables set when Fortran MPI is used, namely, MPI_Fortran_LIB_NAMES, MPI_Fortran_F77_HEADER_DIR, MPI_Fortran_MODULE_DIR, MPI_Fortran_WORKS. They are not set by cray environment of cray-mpich on Gaea. Some relevant info from the log file /lustre/f2/dev/role.epic/contrib/hpc-stack/src-intel-test3/log/crtm.log :

Currently Loaded Modules:
  1) modules/3.2.11.4                                 19) alps/6.6.59-7.0.2.1_3.85__g872a8d62.ari
  2) CmrsEnv                                          20) atp/2.1.3
  3) TimeZoneEDT                                      21) rca/2.2.20-7.0.2.1_2.93__g8e3fb5b.ari
  4) globus-toolkit/6.0.17                            22) perftools-base/7.1.3
  5) darshan/3.2.1                                    23) PrgEnv-intel/6.0.5
  6) DefApps                                          24) craype-broadwell
  7) craype-network-aries                             25) cmake/3.20.1
  8) eproxy/2.0.24-7.0.2.1_2.37__g8e04b33.ari         26) miniconda3/4.12.0
  9) craype/2.6.3                                     27) git/2.31.1
 10) cray-libsci/19.06.1                              28) git-lfs/2.11.0
 11) udreg/2.3.2-7.0.2.1_2.52__g8175d3d.ari           29) hpc/1.2.0
 12) ugni/6.0.14.0-7.0.2.1_3.77__ge78e5b0.ari         30) intel/2021.3.0
 13) pmi/5.0.15                                       31) hpc-intel/2021.3.0
 14) dmapp/7.1.1-7.0.2.1_2.98__g38cf134.ari           32) cray-mpich/7.7.11
 15) gni-headers/5.0.12.0-7.0.2.1_2.34__g3b1768f.ari  33) hpc-cray-mpich/7.7.11
 16) xpmem/2.2.20-7.0.2.1_2.72__g87eb960.ari          34) hdf5/1.10.6
 17) job/2.2.4-7.0.2.1_2.86__g36b56f4.ari             35) netcdf/4.7.4
 18) dvs/2.12_2.2.177-7.0.2.1_13.5__g0b75e43d

 ...
 + cd crtm-v2.4.0
+ [[ -d build ]]
+ mkdir -p build
+ cd build
+ cmake .. -DCMAKE_INSTALL_PREFIX=/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-test3/intel-2021.3.0/crtm/2.4.0 -DENABLE_TESTS=OFF -DOPENMP=OFF
-- The Fortran compiler identification is Intel 20.2.3.20210609
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Check for working Fortran compiler: /opt/intel/oneapi/compiler/2021.3.0/linux/bin/intel64/ifort - skipped
-- Checking whether /opt/intel/oneapi/compiler/2021.3.0/linux/bin/intel64/ifort supports Fortran 90
-- Checking whether /opt/intel/oneapi/compiler/2021.3.0/linux/bin/intel64/ifort supports Fortran 90 - yes
-- Setting build type to 'Release' as none was specified.
-- Found OpenMP_Fortran: -qopenmp (found version "5.0") 
-- Found OpenMP: TRUE (found version "5.0") found components: Fortran 
-- Could NOT find MPI_Fortran (missing: MPI_Fortran_LIB_NAMES MPI_Fortran_F77_HEADER_DIR MPI_Fortran_MODULE_DIR MPI_Fortran_WORKS) 
CMake Error at /ncrc/sw/gaea-cle7/uasw/ncrc/envs/20200417/opt/linux-sles15-x86_64/gcc-7.5.0/cmake-3.20.1-w7tkahac22qulhbcbi6io54u5dfr36zs/share/cmake-3.20/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find MPI (missing: MPI_Fortran_FOUND)
Call Stack (most recent call first):
  /ncrc/sw/gaea-cle7/uasw/ncrc/envs/20200417/opt/linux-sles15-x86_64/gcc-7.5.0/cmake-3.20.1-w7tkahac22qulhbcbi6io54u5dfr36zs/share/cmake-3.20/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
  /ncrc/sw/gaea-cle7/uasw/ncrc/envs/20200417/opt/linux-sles15-x86_64/gcc-7.5.0/cmake-3.20.1-w7tkahac22qulhbcbi6io54u5dfr36zs/share/cmake-3.20/Modules/FindMPI.cmake:1742 (find_package_handle_standard_args)
  cmake/FindNetCDF.cmake:186 (find_package)
  CMakeLists.txt:28 (find_package)

-- Configuring incomplete, errors occurred!
See also "/lustre/f2/dev/role.epic/contrib/hpc-stack/src-intel-test3/pkg/crtm-v2.4.0/build/CMakeFiles/CMakeOutput.log".

How could I view the GH logs to compare the issues reported there?

You can navigate to the "Actions" tab on the hpc-stack repo page and find the action that failed, or navigate to the "Checks" tab on the PR. From there, you can find the action that failed and download the log artifcact.

natalie-perlin commented 1 year ago

@aerorahul Intel build failure in https://github.com/NOAA-EMC/hpc-stack/actions/runs/4782048542 No space left on device error and warning

 build
Unhandled exception. System.IO.IOException: No space left on device : '/home/runner/runners/2.303.0/_diag/Worker_20230424-030433-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at System.Diagnostics.TraceSource.Flush()
   at GitHub.Runner.Common.TraceManager.Dispose(Boolean disposing)
   at GitHub.Runner.Common.TraceManager.Dispose()
   at GitHub.Runner.Common.HostContext.Dispose(Boolean disposing)
   at GitHub.Runner.Common.HostContext.Dispose()
   at GitHub.Runner.Worker.Program.Main(String[] args)
System.IO.IOException: No space left on device : '/home/runner/runners/2.303.0/_diag/Worker_20230424-030433-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
System.IO.IOException: No space left on device : '/home/runner/runners/2.303.0/_diag/Worker_20230424-030433-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Common.Tracing.Error(Exception exception)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
build
You are running out of disk space. The runner will stop working when the machine runs out of disk space. Free space left: 74 MB
DavidHuber-NOAA commented 1 year ago

@natalie-perlin I have submitted a PR to your branch.

aerorahul commented 1 year ago

@natalie-perlin This is the root cause of the failure. I peeled this from https://github.com/NOAA-EMC/hpc-stack/actions/runs/4791351281/jobs/8521648314#step:7:18785

CMake Error at cmake/FindNetCDF.cmake:225 (message):
  Unable to properly find NetCDF.  Found static libraries at:
  /home/runner/work/hpc-stack/hpc-stack/mpich/pkg/crtm-v2.4.0/NetCDF_C_LIBRARY-NOTFOUND
  but could not run nc-config:
Call Stack (most recent call first):
  CMakeLists.txt:28 (find_package)

CMake Error at cmake/FindNetCDF.cmake:225 (message):
  Unable to properly find NetCDF.  Found static libraries at:
  /home/runner/work/hpc-stack/hpc-stack/mpich/pkg/crtm-v2.4.0/NetCDF_Fortran_LIBRARY-NOTFOUND
  but could not run nc-config:
Call Stack (most recent call first):
  CMakeLists.txt:28 (find_package)

CMake Error at /usr/local/share/cmake-3.26/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find NetCDF (missing: NetCDF_INCLUDE_DIRS NetCDF_LIBRARIES
  Fortran)
Call Stack (most recent call first):
  /usr/local/share/cmake-3.26/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
  cmake/FindNetCDF.cmake:291 (find_package_handle_standard_args)
  CMakeLists.txt:28 (find_package)

-- Configuring incomplete, errors occurred!
BUILD FAIL!  Lib: crtm Error:1
natalie-perlin commented 1 year ago

@aerorahul - Yes, thank you. I planned to diagnose how these variables (NetCDF_INCLUDE_DIRS, NetCDF_LIBRARIES) are set on other platforms, and maybe set them explicitly during cmake. Another option to set these variables in netcdf modulefile (could be a better approach)

aerorahul commented 1 year ago

@aerorahul - Yes, thank you. I planned to diagnose how these variables (NetCDF_INCLUDE_DIRS, NetCDF_LIBRARIES) are set on other platforms, and maybe set them explicitly during cmake. Another option to set these variables in netcdf modulefile (could be a better approach)

@natalie-perlin I inspected your build log. It appears hdf5 fails to build silently. See logs starting line 10784.

/usr/bin/ld: ../../../src/.libs/libhdf5.a(H5PLint.o): in function `H5PL__open':
H5PLint.c:(.text+0x533): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: /usr/lib/x86_64-linux-gnu/libmpich.a(mpl_sockaddr.o): in function `MPL_get_sockaddr':
(.text+0xca): warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: /usr/lib/x86_64-linux-gnu/librt.a(aio_misc.o): in function `handle_fildes_io':
(.text+0x11e): undefined reference to `pthread_getschedparam'
/usr/bin/ld: (.text+0x162): undefined reference to `pthread_setschedparam'
/usr/bin/ld: /usr/lib/x86_64-linux-gnu/librt.a(aio_misc.o): in function `__aio_enqueue_request':
(.text+0x887): undefined reference to `pthread_getschedparam'
/usr/bin/ld: (.text+0xb56): undefined reference to `pthread_attr_setstacksize'
collect2: error: ld returned 1 exit status
make[3]: *** [Makefile:863: h5diff] Error 1
make[3]: Leaving directory '/home/runner/work/hpc-stack/hpc-stack/mpich/pkg/hdf5-1_10_6/build/tools/src/h5diff'
make[2]: *** [Makefile:813: all-recursive] Error 1
make[2]: Leaving directory '/home/runner/work/hpc-stack/hpc-stack/mpich/pkg/hdf5-1_10_6/build/tools/src'
make[1]: *** [Makefile:813: all-recursive] Error 1
make[1]: Leaving directory '/home/runner/work/hpc-stack/hpc-stack/mpich/pkg/hdf5-1_10_6/build/tools'
make: *** [Makefile:660: all-recursive] Error 1
call /home/runner/work/hpc-stack/hpc-stack/mpich/libs/build_hdf5.sh for hdf5 build , log is /home/runner/work/hpc-stack/hpc-stack/mpich/log/hdf5.log 
BUILD SUCCESS! Lib: hdf5

Following that, netcdf also fails silently. See logs starting line 10977

checking whether the C compiler works... no
configure: error: in `/home/runner/work/hpc-stack/hpc-stack/mpich/pkg/netcdf-c-4.7.4/build':
configure: error: C compiler cannot create executables
See `config.log' for more details
call /home/runner/work/hpc-stack/hpc-stack/mpich/libs/build_netcdf.sh for netcdf build , log is /home/runner/work/hpc-stack/hpc-stack/mpich/log/netcdf.log 
BUILD SUCCESS! Lib: netcdf
natalie-perlin commented 1 year ago

@aerorahul @Hang-Lei-NOAA - I've made a couple more changes

The stack successfully builds with these changes on Orion, Gaea, Hera, MacOS, ParallelWorks AWS (tested manually). The GitHub checks seem all to pass now. Please let me know if it's OK to merge it.

aerorahul commented 1 year ago

@natalie-perlin I have triggered the Intel build. As soon as it completes, the PR can be merged. The disk-space issue is a red herring, I think.

natalie-perlin commented 1 year ago

@aerorahul - the Intel build was skipped.

aerorahul commented 1 year ago

I am going to ignore the failure and address it separately.