NOAA-EMC / hpc-stack

Create a software stack for HPC's
GNU Lesser General Public License v2.1
30 stars 36 forks source link

Add Scotch library with intel and gnu compiler on Hera and Orion #501

Open aliabdolali opened 2 years ago

aliabdolali commented 2 years ago

Please describe the package or library you would like to add to hpc-stack. The Scotch distribution is a set of programs and libraries which implement the static mapping and sparse matrix reordering algorithms developed within the Scotch project. We would like to utilize the graph partitioning capability of Scotch in the WW3 model. Here is the link to the Scotch repository: https://gitlab.inria.fr/scotch/scotch

What applications at NOAA will be using this package or library? UFS-WEATHER-MODEL:

Is there already a package or library in hpc-stack that provides this, or related, functionality? NA

Additional context I have tested the compilation of SCOTCH with GNU and Intel compiler on both Hera and Orion. The compilation with intel is a bit tricky, so here I added the step-by-step instruction: Hera

cd scotch
 module purge
  module load cmake/3.20.1
  module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack
  module load hpc/1.2.0
  module load hpc-intel/2022.1.2
  module load hpc-impi/2022.1.2
  module load netcdf/4.9.0
  module load gnu

 mkdir build
 cd build

cmake VERBOSE=1 -DCMAKE_Fortran_COMPILER=ifort -DCMAKE_C_COMPILER=icc -DCMAKE_INSTALL_PREFIX=<path to scotch>/install  -DTHREADS=OFF -DCMAKE_BUILD_TYPE=Release .. | & tee cmake.out-rr

make VERBOSE=1 | & tee make.out-rr
make install
make scotch
make ptscotch

I compiled scotch following this instruction on Hera in /scratch2/COASTAL/coastal/save/Ali.Abdolali/hpc-stack/scotch/install

Orion


cd scotch
 module purge

mkdir -p $HOME/modulefiles/gcc
cp /apps/modulefiles/core/gcc/10.2.0 $HOME/modulefiles/gcc/

Edit the file $HOME/modulefiles/gcc/10.2.0:
Comment out the 'family "compiler"' line
Comment out the 'prepend-path MODULEPATH' line

  module use $HOME/modulefiles
  module load gcc/10.2.0
  module load cmake/3.22.1
  module use /apps/contrib/NCEP/libs/hpc-stack/modulefiles/stack
  module load hpc/1.1.0
  module load hpc-intel/2018.4
  module load hpc-impi/2018.4
  module load netcdf/4.7.4
  module load jasper/2.0.25
  module load zlib/1.2.11
  module load png/1.6.35
  module load hdf5/1.10.6
  module load bacio/2.4.1
  module load g2/3.4.2
  module load w3emc/2.9.2
  module load esmf/8_2_0
 mkdir build
 cd build

cmake VERBOSE=1 -DCMAKE_Fortran_COMPILER=ifort -DCMAKE_C_COMPILER=icc -DCMAKE_INSTALL_PREFIX=<path to scotch>/install -DTHREADS=OFF -DCMAKE_BUILD_TYPE=Release .. | & tee cmake.out-rr

make VERBOSE=1 | & tee make.out-rr
make install
make scotch
make ptscotch

I compiled scotch following this instruction on Orion in /work/noaa/marine/ali.abdolali/Source/hpc-stack/scotch/install

The Intel compiler depends on GNU for some of its functionality, and needs a more recent version of GNU compiler to function correctly. The default version of GNU that comes with most distribution is pretty old. So if WCOSS2 has a more recent version as the default then it should simply work. Based on the GNU module loaded on hera, it seems to work fine with: gcc (Spack GCC) 9.2.0 If WCOSS2 had an older version of GNU then you see if there is a module for GNU with a more recent compiler.

Will This Package be Needed in an Operational Application? Yes

WCOSS System Software Request Checklist

General questions:

Supervisor or sponsor of the requester: @AvichalMehra-NOAA The license of the package is approved by NCO.

Software name and version, specific URL to the software SCOTCH https://gitlab.inria.fr/scotch/scotch

Software type - New or Upgrade existing? New

Justification ( List NPS models using the software)

Completion Time requested

Software License including all Dependency Software Licenses 

Support contact(s) who must have a WCOSS account

Dependency Software list

Installation instructions

Test and verification instructions

Technical Review Checklist for open source software - Review the source code to answer the following questions

Licenses for the requested software and its dependencies

Licenses -  confirm the software Licenses are acceptable

Maturity

Acceptable - Stable, production, or equivalent

Self-contained

No external http, https,  ftp, or other URI exists except that in comments

No binary files in the package unless they are in the approved list

No publicly disclosed cybersecurity vulnerabilities and exposures 

Searching https://cve.mitre.org/cve/

Security High Level Checklists 

Is it prohibited by DHS/DOC/NOAA/NWS

Is it provided by a trusted source? Trusted sources include other NWS, NOAA, or DOC, agencies, or other Federal agencies that operate at a FISMA high or equivalent level. Additionally, trusted sources could be third-party agencies through which there is an existing SLA on file (such as RedHat). 

Is software support offered (is it being updated and patched). Yes, the main developers agreed to support the software.

If maintained by a private entity, does the entity operate in a foreign country, especially a prohibited foreign country (China, Russia, Iran, North Korea, etc.). 

Is there sufficient documentation to support maintenance Yes

Are there known vulnerabilities or weaknesses No

Is there a need for privileged processes 

Are there software dependencies, are those dependencies approved or do they have any security concerns 

Are there any other concerns related to SA, SI, and SC control families

Hang-Lei-NOAA commented 2 years ago

EPIC has handled the installations on Hera and Orion. Jong is the contact person.

aliabdolali commented 2 years ago

Hi @jkbk2004 do you have any status updates or a timeline? @AvichalMehra-NOAA

jkbk2004 commented 2 years ago

@aliabdolali Thanks for reminding! On EPIC side, I think we can follow up by early next week (say Monday). There are a few lib update check list. Doe it work on your side?

aliabdolali commented 2 years ago

@jkbk2004 Yes, it works for us. Thanks in advance.

jkbk2004 commented 2 years ago

@natalie-perlin @ulmononian We need to follow up the scotch instrallation.

aliabdolali commented 2 years ago

Please describe the package or library you would like to add to hpc-stack. The Scotch distribution is a set of programs and libraries which implement the static mapping and sparse matrix reordering algorithms developed within the Scotch project. We would like to utilize the graph partitioning capability of Scotch in the WW3 model. Here is the link to the Scotch repository: https://gitlab.inria.fr/scotch/scotch

What applications at NOAA will be using this package or library? UFS-WEATHER-MODEL:

  • GFSv17
  • GEFSv13 UFS-COASTAL
  • GLWUv3

Is there already a package or library in hpc-stack that provides this, or related, functionality? NA

Additional context I have tested the compilation of SCOTCH with GNU and Intel compiler on both Hera and Orion. The compilation with intel is a bit tricky, so here I added the step-by-step instruction: Hera

cd scotch
 module purge
  module load cmake/3.20.1
  module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack
  module load hpc/1.2.0
  module load hpc-intel/2022.1.2
  module load hpc-impi/2022.1.2
  module load netcdf/4.9.0
  module load gnu

 mkdir build
 cd build

cmake VERBOSE=1 -DCMAKE_Fortran_COMPILER=ifort -DCMAKE_C_COMPILER=icc -DCMAKE_INSTALL_PREFIX=<path to scotch>/install  -DCMAKE_BUILD_TYPE=Release .. | & tee cmake.out-rr

make VERBOSE=1 | & tee make.out-rr
make install
make scotch
make ptscotch

I compiled scotch following this instruction on Hera in /scratch2/COASTAL/coastal/save/Ali.Abdolali/hpc-stack/scotch/install

Orion


cd scotch
module purge

mkdir -p $HOME/modulefiles/gcc
cp /apps/modulefiles/core/gcc/10.2.0 $HOME/modulefiles/gcc/

Edit the file $HOME/modulefiles/gcc/10.2.0:
Comment out the 'family "compiler"' line
Comment out the 'prepend-path MODULEPATH' line

 module use $HOME/modulefiles
 module load gcc/10.2.0
 module load cmake/3.22.1
 module use /apps/contrib/NCEP/libs/hpc-stack/modulefiles/stack
 module load hpc/1.1.0
 module load hpc-intel/2018.4
 module load hpc-impi/2018.4
 module load netcdf/4.7.4
 module load jasper/2.0.25
 module load zlib/1.2.11
 module load png/1.6.35
 module load hdf5/1.10.6
 module load bacio/2.4.1
 module load g2/3.4.2
 module load w3emc/2.9.2
 module load esmf/8_2_0
mkdir build
cd build

cmake VERBOSE=1 -DCMAKE_Fortran_COMPILER=ifort -DCMAKE_C_COMPILER=icc -DCMAKE_INSTALL_PREFIX=<path to scotch>/install  -DCMAKE_BUILD_TYPE=Release .. | & tee cmake.out-rr

make VERBOSE=1 | & tee make.out-rr
make install
make scotch
make ptscotch

I compiled scotch following this instruction on Orion in /work/noaa/marine/ali.abdolali/Source/hpc-stack/scotch/install

The Intel compiler depends on GNU for some of its functionality, and needs a more recent version of GNU compiler to function correctly. The default version of GNU that comes with most distribution is pretty old. So if WCOSS2 has a more recent version as the default then it should simply work. Based on the GNU module loaded on hera, it seems to work fine with: gcc (Spack GCC) 9.2.0 If WCOSS2 had an older version of GNU then you see if there is a module for GNU with a more recent compiler.

Will This Package be Needed in an Operational Application? Yes

WCOSS System Software Request Checklist

General questions:

Supervisor or sponsor of the requester: @AvichalMehra-NOAA The license of the package is approved by NCO.

Software name and version, specific URL to the software SCOTCH https://gitlab.inria.fr/scotch/scotch

Software type - New or Upgrade existing? New

Justification ( List NPS models using the software)

Completion Time requested

Software License including all Dependency Software Licenses 

Support contact(s) who must have a WCOSS account

Dependency Software list

Installation instructions

Test and verification instructions

Technical Review Checklist for open source software - Review the source code to answer the following questions

Licenses for the requested software and its dependencies

Licenses -  confirm the software Licenses are acceptable

Maturity

Acceptable - Stable, production, or equivalent

Self-contained

No external http, https,  ftp, or other URI exists except that in comments

No binary files in the package unless they are in the approved list

No publicly disclosed cybersecurity vulnerabilities and exposures 

Searching https://cve.mitre.org/cve/

Security High Level Checklists 

Is it prohibited by DHS/DOC/NOAA/NWS

Is it provided by a trusted source? Trusted sources include other NWS, NOAA, or DOC, agencies, or other Federal agencies that operate at a FISMA high or equivalent level. Additionally, trusted sources could be third-party agencies through which there is an existing SLA on file (such as RedHat). 

Is software support offered (is it being updated and patched). Yes, the main developers agreed to support the software.

If maintained by a private entity, does the entity operate in a foreign country, especially a prohibited foreign country (China, Russia, Iran, North Korea, etc.). 

Is there sufficient documentation to support maintenance Yes

Are there known vulnerabilities or weaknesses No

Is there a need for privileged processes 

Are there software dependencies, are those dependencies approved or do they have any security concerns 

Are there any other concerns related to SA, SI, and SC control families

I updated the instructions on Hera and Orion.

ulmononian commented 2 years ago

@aliabdolali thank you for providing your installation instructions. i am currently working to add this to the hpc-stack on hera (@natalie-perlin will be handling orion), and will update you once it is added.

to aid in determining which stack to add this to, do you require netcdf-4.9.0 or higher to run scotch w/ your applications?

thank you!

JessicaMeixner-NOAA commented 1 year ago

@aliabdolali is the expert, but from everything I have read and tested, we do not require netcdf-4.9.0 or higher.

@MatthewMasarik-NOAA and I are testing the install instructions above to see if we need the lines:

make scotch
make ptscotch

after the make install as this does not appear needed from the install instructions on SCOTCH seen here: https://gitlab.inria.fr/scotch/scotch/-/blob/master/INSTALL.txt#L60-151

MatthewMasarik-NOAA commented 1 year ago

@MatthewMasarik-NOAA and I are testing the install instructions above to see if we need the lines:

make scotch
make ptscotch

I revised the install instructions with these updates plus edit for hpc-stack role.epic module file path. These were used to successfully build and run WW3 regression tests on hera.

Hera SCOTCH Install

# https://gitlab.inria.fr/scotch/scotch.git

cd scotch

module purge
module load cmake/3.20.1
module load intel/2022.1.2
module load impi/2022.1.2
module use  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/modulefiles/stack
module load hpc/1.2.0
module load hpc-intel/2022.1.2
module load hpc-impi/2022.1.2
module load hdf5/1.10.6
module load netcdf/4.7.4
module load gnu/9.2.0

mkdir build && cd build
cmake -DCMAKE_Fortran_COMPILER=ifort            \
      -DCMAKE_C_COMPILER=icc                    \
      -DCMAKE_INSTALL_PREFIX=<path-to>/install  \
      -DCMAKE_BUILD_TYPE=Release ..             |& tee cmake.out
make  VERBOSE=1                                 |& tee make.out
make  install
ulmononian commented 1 year ago

@JessicaMeixner-NOAA @aliabdolali @MatthewMasarik-NOAA has anyone on the team had success installing scotch on cheyenne?

also: is version 7.0.1 still your preference? i noticed 7.0.3 is out w/ full cmake installation functionality.

jkbk2004 commented 1 year ago

@JessicaMeixner-NOAA @aliabdolali @MatthewMasarik-NOAA has anyone on the team had success installing scotch on cheyenne?

also: is version 7.0.1 still your preference? i noticed 7.0.3 is out w/ full cmake installation functionality.

@DeniseWorthen can you jump in?

MatthewMasarik-NOAA commented 1 year ago

@ulmononian regarding version, we would like 7.0.3 installed on hpc.

Neither @JessicaMeixner-NOAA or @aliabdolali or myself have access to Cheyenne, so for us, no.

DeniseWorthen commented 1 year ago

@jkbk2004 As long as it compiles and runs, I have no preferences.

JessicaMeixner-NOAA commented 1 year ago

I will note that we've been investigating what build flags to use which might be contributing to scaling problems we are seeing with SCOTCH. See a github issue here: https://github.com/NOAA-EMC/WW3/issues/879

Also I have a ticket open with the orion helpdesk to update the intel/gnu issue so that you do not have to modify the gnu module file to download.

Also I don't know if spack has the 7.0.3 version already in spack-stack.

ulmononian commented 1 year ago

@JessicaMeixner-NOAA @aliabdolali @DeniseWorthen thanks for the updates and information.

i've tried installing 7.0.1 on cheyenne (again have to use a gnu module file hack) but it is having issues in the make step with parsing using bison. i'm not sure what the min. required version is, but cheyenne only has 2.7. i tried installing the newest bison but that was not successful.

i will try 7.0.3 using the cmake system...

ulmononian commented 1 year ago

@JessicaMeixner-NOAA i will follow up with the scotch spack issue and see where it stands. we may be able to get 7.0.3 in the preferred versions list at least in the emc fork for now. will report back.

JessicaMeixner-NOAA commented 1 year ago

I just looked at the SCOTCH library and found this for bison/flex: https://gitlab.inria.fr/scotch/scotch/-/blob/master/INSTALL.txt#L13-20

I'm not sure what "most recent" really is. I have an issue open on the scotch repository asking about the required version of gnu (for the header) and will add a request to find about about flex and bison minimum versions as well.

JessicaMeixner-NOAA commented 1 year ago

I know for orion we needed to turn of pthreads to run successfully (-DTHREADS="OFF") and turning off MPI multiple threads (-DMPI_THREAD_MULTIPLE="OFF") was also helpful, but for now seems unrelated to your issues, just thought I'd mention it all the same.

JessicaMeixner-NOAA commented 1 year ago

Here's my scotch issue asking for minimum versions: https://gitlab.inria.fr/scotch/scotch/-/issues/21

ulmononian commented 1 year ago

thanks for those tips and also for inquiring about the minimum versions w/ the scotch team, @JessicaMeixner-NOAA.

some updates from my side on cheyenne: i was able to successfully install scotch@7.0.3 here /glade/scratch/bcameron/scotch/install by first doing:

  1. installing bison@3.8.2 in my user-space (as the cheyenne system bison@2.7 is too old -- learned by trial-and-error).
  2. following the gnu modulefile tweak for orion described by @aliabdolali, but adapted to cheyenne's gnu@10.1.0

@DeniseWorthen are you going to be testing scotch on cheyenne? if so, i will add it to the intel/2022.1 hpc-stack there.

given that each machine so far has required some ad-hoc approaches for installing scotch, i am hopeful that it can be supported more smoothly using spack and the spack-stack. i will be posting updates on that effort here NOAA-EMC/spack-stack issue #465.

ulmononian commented 1 year ago

FWIW, scotch@7.0.3 is now installed as part of cheyenne's intel/2022.1 hpc-stack. load with:

module use /glade/work/epicufsrt/contrib/hpc-stack/intel2022.1/modulefiles/stack/
module load hpc/1.2.0
module load hpc-intel/2022.1
module load hpc-mpt/2.25
module load scotch/7.0.3
DeniseWorthen commented 1 year ago

@ulmononian Yes, I hope to be able to test on cheyenne, where the Q waits and job turn-around is much better for debugging purposes.

DeniseWorthen commented 1 year ago

@ulmononian Thanks for the scotch install. It will be a few days before I can give it a road test. I'll let you know if I have issues.

JessicaMeixner-NOAA commented 1 year ago

@ulmononian I have heard from the SCOTCH developers and the minimum version of bison needed is 3.4 --- see response here: https://gitlab.inria.fr/scotch/scotch/-/issues/21

There is also a response about the minimum version of gnu --- which can be found here: https://gitlab.inria.fr/scotch/scotch/-/issues/19#note_808510 and relevant parts are repeated here:

To my understanding, this is not an issue of gnu or intel compiler versions per se, but of which C standard version they are referring to when processing the source code files.

Indeed, PRIu64 and its likes were inserted in C99 : https://en.cppreference.com/w/c/types/integer

Hence, in order to compile Scotch without errors, one has to make sure the compiler accepts the C99 standard,e.g., by using the "-std=c99" flag in gcc. Include files should already have the proper information, but what matters is to provide the adequate #define's to make this information accessible to the compilers.

JessicaMeixner-NOAA commented 1 year ago

On orion, a new module has been created, so that we can use later gcc with intel, to load:

module load contrib/0.1
module load noaa-gcc/10.2.0
ulmononian commented 1 year ago

On orion, a new module has been created, so that we can use later gcc with intel, to load:

module load contrib/0.1
module load noaa-gcc/10.2.0

this is very helpful. i wonder if the same thing could happen on cheyenne. @DeniseWorthen did you have time to test ww3 against the scotch install there?

DeniseWorthen commented 1 year ago

@ulmononian I'm sorry, I have not had a chance to test this yet.

DeniseWorthen commented 1 year ago

@ulmononian I did build and run on cheyenne using the scotch lib using my test setup. All I did was add

module load scotch/7.0.3

and then my usual compile.sh command. I'm using intel.

MatthewMasarik-NOAA commented 1 year ago

Hi, I wanted to report a seg fault when testing the hpc-stack module scotch/7.0.3.

I'm testing using the WW3 standalone regression tests. I've found that I can build WW3 successfully, but when I try to run it I get a seg fault during model initialization.

After doing the module loads, here the output of module list

Currently Loaded Modules:                                                                                                                                            
  1) cmake/3.20.1         7) libpng/1.6.37  13) g2/3.4.5                                                                                                             
  2) hpc/1.2.0            8) zlib/1.2.11    14) w3emc/2.9.2                                                                                                          
  3) intel/2022.1.2       9) jasper/2.0.25  15) esmf/8.3.0b09                                                                                                        
  4) hpc-intel/2022.1.2  10) hdf5/1.10.6    16) scotch/7.0.3                                                                                                         
  5) impi/2022.1.2       11) netcdf/4.7.4                                                                                                                            
  6) hpc-impi/2022.1.2   12) bacio/2.4.1 

I have set the needed environment parameter SCOTCH_PATH as

export SCOTCH_PATH=/scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/scotch/7.0.3

And here is the part of the log file when the model is initializing, then crashes

       Type 4 : Restart files                                                                                                                                        
      -----------------------------------------                                                                                                                      
            From     : 2015/12/14 02:00:00 UTC                                                                                                                       
            To       : 2015/12/15 00:00:00 UTC                                                                                                                       
            Interval :            01:00:00                                                                                                                           

            output dates out of run dates : Track point output deactivated                                                                                           
            output dates out of run dates : Nesting data deactivated                                                                                                 
            output dates out of run dates : Partitioned wave field data deactivated                                                                                  
            output dates out of run dates : Restart files second request deactivated                                                                                 
       Wave model ...                                                                                                                                                
[h22c29:135973:0:136050] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1c)                                                         
Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_recv.h at line 1257: FALSE                                                       
[h22c29:135978:0:135978] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x90)                                                         
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x2baa13a7dbcc]                                                                         
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x2baa13457df1]                                                                           
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x8febe9) [0x2baa13773be9]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x8fe3a9) [0x2baa137733a9]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x21ff55) [0x2baa13094f55]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x21fa56) [0x2baa13094a56]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x898e8e) [0x2baa1370de8e]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x406dea) [0x2baa1327bdea]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0xf690d) [0x2baa12f6b90d]                                                                                        
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x19e75c) [0x2baa1301375c]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x1717ec) [0x2baa12fe67ec]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b37f5) [0x2baa131287f5]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPI_Allgather+0x706) [0x2baa12f6d206]                                                                             
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/code_mgmt/work/scotch_testing/ww3-hpc-stack/regtests/ww3_tp2.17/work_b/exe/ww3_shel() [0x6fd4a7]                     
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/code_mgmt/work/scotch_testing/ww3-hpc-stack/regtests/ww3_tp2.17/work_b/exe/ww3_shel() [0x6fdad1]                     
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/code_mgmt/work/scotch_testing/ww3-hpc-stack/regtests/ww3_tp2.17/work_b/exe/ww3_shel() [0x6ee987]                     
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/code_mgmt/work/scotch_testing/ww3-hpc-stack/regtests/ww3_tp2.17/work_b/exe/ww3_shel() [0x7041e3]                     
/lib64/libpthread.so.0(+0x7ea5) [0x2baa148b9ea5]                                                                                                                     
/lib64/libc.so.6(clone+0x6d) [0x2baa14bccb0d]                                                                                                                        
Abort(1) on node 18: Internal error                                                                                                                                  
[h22c29:135977:0:135977] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2baaf7442288)                                               
srun: error: h22c29: task 18: Exited with exit code 1                                                                                                                
srun: launch/slurm: _step_signal: Terminating StepId=44032152.2                                                                                                      
slurmstepd: error: *** STEP 44032152.2 ON h22c29 CANCELLED AT 2023-04-20T20:45:36 ***                                                                                
forrtl: error (78): process killed (SIGTERM)

I can confirm that running the WW3 develop branch, the same regtests are successful, which use

export SCOTCH_PATH=/scratch1/NCEPDEV/climate/Matthew.Masarik/waves/opt/hpc-stack/scotch-v7.0.3/install
DeniseWorthen commented 1 year ago

Is this Intel or GNU? I also found a problem that only showed up when I tried GNU; it isn't related to SCOTCH. All my development work was done w/ Intel debug. This may or may not be the same issue, but this is the fix:

diff --git a/model/src/w3iorsmd.F90 b/model/src/w3iorsmd.F90
index 8d720413..f08e3f5a 100644
--- a/model/src/w3iorsmd.F90
+++ b/model/src/w3iorsmd.F90
@@ -880,10 +880,12 @@ CONTAINS
         IF ( IAPROC .EQ. NAPRST ) THEN
           !
 #ifdef W3_MPI
-          ALLOCATE ( STAT2(MPI_STATUS_SIZE,NRQRS) )
-          CALL MPI_WAITALL                               &
-               ( NRQRS, IRQRS , STAT2, IERR_MPI )
-          DEALLOCATE ( STAT2 )
+          if(associated(irqrs)) then
+            ALLOCATE ( STAT2(MPI_STATUS_SIZE,NRQRS) )
+            CALL MPI_WAITALL                               &
+                 ( NRQRS, IRQRS , STAT2, IERR_MPI )
+            DEALLOCATE ( STAT2 )
+          end if
 #endif
MatthewMasarik-NOAA commented 1 year ago

Is this Intel or GNU? I also found a problem that only showed up when I tried GNU; it isn't related to SCOTCH. All my development work was done w/ Intel debug. This may or may not be the same issue, but this is the fix:

diff --git a/model/src/w3iorsmd.F90 b/model/src/w3iorsmd.F90
index 8d720413..f08e3f5a 100644
--- a/model/src/w3iorsmd.F90
+++ b/model/src/w3iorsmd.F90
@@ -880,10 +880,12 @@ CONTAINS
         IF ( IAPROC .EQ. NAPRST ) THEN
           !
 #ifdef W3_MPI
-          ALLOCATE ( STAT2(MPI_STATUS_SIZE,NRQRS) )
-          CALL MPI_WAITALL                               &
-               ( NRQRS, IRQRS , STAT2, IERR_MPI )
-          DEALLOCATE ( STAT2 )
+          if(associated(irqrs)) then
+            ALLOCATE ( STAT2(MPI_STATUS_SIZE,NRQRS) )
+            CALL MPI_WAITALL                               &
+                 ( NRQRS, IRQRS , STAT2, IERR_MPI )
+            DEALLOCATE ( STAT2 )
+          end if
 #endif

This is for hera/intel. I should have mentioned that.

That is very interesting. I don't know how it fits into the puzzle right now because I can run the same code, but with a different SCOTCH install.. Very interesting though.

ulmononian commented 1 year ago

Hi, I wanted to report a seg fault when testing the hpc-stack module scotch/7.0.3.

I'm testing using the WW3 standalone regression tests. I've found that I can build WW3 successfully, but when I try to run it I get a seg fault during model initialization.

After doing the module loads, here the output of module list

Currently Loaded Modules:                                                                                                                                            
  1) cmake/3.20.1         7) libpng/1.6.37  13) g2/3.4.5                                                                                                             
  2) hpc/1.2.0            8) zlib/1.2.11    14) w3emc/2.9.2                                                                                                          
  3) intel/2022.1.2       9) jasper/2.0.25  15) esmf/8.3.0b09                                                                                                        
  4) hpc-intel/2022.1.2  10) hdf5/1.10.6    16) scotch/7.0.3                                                                                                         
  5) impi/2022.1.2       11) netcdf/4.7.4                                                                                                                            
  6) hpc-impi/2022.1.2   12) bacio/2.4.1 

I have set the needed environment parameter SCOTCH_PATH as

export SCOTCH_PATH=/scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/scotch/7.0.3

And here is the part of the log file when the model is initializing, then crashes

       Type 4 : Restart files                                                                                                                                        
      -----------------------------------------                                                                                                                      
            From     : 2015/12/14 02:00:00 UTC                                                                                                                       
            To       : 2015/12/15 00:00:00 UTC                                                                                                                       
            Interval :            01:00:00                                                                                                                           

            output dates out of run dates : Track point output deactivated                                                                                           
            output dates out of run dates : Nesting data deactivated                                                                                                 
            output dates out of run dates : Partitioned wave field data deactivated                                                                                  
            output dates out of run dates : Restart files second request deactivated                                                                                 
       Wave model ...                                                                                                                                                
[h22c29:135973:0:136050] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1c)                                                         
Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_recv.h at line 1257: FALSE                                                       
[h22c29:135978:0:135978] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x90)                                                         
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x2baa13a7dbcc]                                                                         
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x2baa13457df1]                                                                           
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x8febe9) [0x2baa13773be9]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x8fe3a9) [0x2baa137733a9]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x21ff55) [0x2baa13094f55]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x21fa56) [0x2baa13094a56]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x898e8e) [0x2baa1370de8e]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x406dea) [0x2baa1327bdea]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0xf690d) [0x2baa12f6b90d]                                                                                        
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x19e75c) [0x2baa1301375c]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x1717ec) [0x2baa12fe67ec]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b37f5) [0x2baa131287f5]                                                                                       
/apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPI_Allgather+0x706) [0x2baa12f6d206]                                                                             
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/code_mgmt/work/scotch_testing/ww3-hpc-stack/regtests/ww3_tp2.17/work_b/exe/ww3_shel() [0x6fd4a7]                     
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/code_mgmt/work/scotch_testing/ww3-hpc-stack/regtests/ww3_tp2.17/work_b/exe/ww3_shel() [0x6fdad1]                     
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/code_mgmt/work/scotch_testing/ww3-hpc-stack/regtests/ww3_tp2.17/work_b/exe/ww3_shel() [0x6ee987]                     
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/code_mgmt/work/scotch_testing/ww3-hpc-stack/regtests/ww3_tp2.17/work_b/exe/ww3_shel() [0x7041e3]                     
/lib64/libpthread.so.0(+0x7ea5) [0x2baa148b9ea5]                                                                                                                     
/lib64/libc.so.6(clone+0x6d) [0x2baa14bccb0d]                                                                                                                        
Abort(1) on node 18: Internal error                                                                                                                                  
[h22c29:135977:0:135977] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2baaf7442288)                                               
srun: error: h22c29: task 18: Exited with exit code 1                                                                                                                
srun: launch/slurm: _step_signal: Terminating StepId=44032152.2                                                                                                      
slurmstepd: error: *** STEP 44032152.2 ON h22c29 CANCELLED AT 2023-04-20T20:45:36 ***                                                                                
forrtl: error (78): process killed (SIGTERM)

I can confirm that running the WW3 develop branch, the same regtests are successful, which use

export SCOTCH_PATH=/scratch1/NCEPDEV/climate/Matthew.Masarik/waves/opt/hpc-stack/scotch-v7.0.3/install

thanks for this information. can i ask how you built 7.0.3 on hera? perhaps i need to modify the build script in some way.

MatthewMasarik-NOAA commented 1 year ago

@ulmononian, yes certainly. I'll get back early Fri with instructions.

MatthewMasarik-NOAA commented 1 year ago

@ulmononian, there isn't much new here, but for completeness this is how I built SCOTCH on hera.

git clone https://gitlab.inria.fr/scotch/scotch.git
cd scotch
git checkout v7.0.3

module purge
module load cmake/3.20.1
module load intel/2022.1.2
module load impi/2022.1.2
module use  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/modulefiles/stack
module load hpc/1.2.0
module load hpc-intel/2022.1.2
module load hpc-impi/2022.1.2
module load gnu/9.2.0

mkdir build && cd build
cmake -DCMAKE_Fortran_COMPILER=ifort            \
      -DCMAKE_C_COMPILER=icc                    \
      -DCMAKE_INSTALL_PREFIX=<path-to>/install  \
      -DCMAKE_BUILD_TYPE=Release                     \
      -DTHREADS=OFF    ..             |& tee cmake.out
make  VERBOSE=1                                 |& tee make.out
make  install
ulmononian commented 1 year ago

@MatthewMasarik-NOAA thanks for those build instructions. based on your use of -DTHREADS="OFF" here (on hera), and @JessicaMeixner-NOAA's note that using both -DTHREADS="OFF" and -DMPI_THREAD_MULTIPLE="OFF" was done on orion, i want to confirm here what cmake flags the ww3 group needs for the build on all machines. i did not use either of these threading flags for the cheyenne build that @DeniseWorthen tested; however, i am not sure if the WW3 standalone RTs or WM rts were part of that testing. simply put, if we could just confirm which flags are required, the installations can be corrected & then put on other machines.

on a related note, @natalie-perlin will be taking over the scotch work for the hpc-stack from this point forward. i'm happy to help where i can and will stay tuned to the discussion, but i will be focusing on the spack-stack scotch installations. the pr to add scotch to spack-stack is here: https://github.com/NOAA-EMC/spack-stack/pull/550. it would be great if anyone from ww3 could test the spack-stack UE w/ scotch on hera or orion. feel free to contact me or comment over at that PR if interested.

MatthewMasarik-NOAA commented 1 year ago

@MatthewMasarik-NOAA thanks for those build instructions. based on your use of -DTHREADS="OFF" here (on hera), and @JessicaMeixner-NOAA's note that using both -DTHREADS="OFF" and -DMPI_THREAD_MULTIPLE="OFF" was done on orion, i want to confirm here what cmake flags the ww3 group needs for the build on all machines. i did not use either of these threading flags for the cheyenne build that @DeniseWorthen tested; however, i am not sure if the WW3 standalone RTs or WM rts were part of that testing. simply put, if we could just confirm which flags are required, the installations can be corrected & then put on other machines.

on a related note, @natalie-perlin will be taking over the scotch work for the hpc-stack from this point forward. i'm happy to help where i can and will stay tuned to the discussion, but i will be focusing on the spack-stack scotch installations. the pr to add scotch to spack-stack is here: NOAA-EMC/spack-stack#550. it would be great if anyone from ww3 could test the spack-stack UE w/ scotch on hera or orion. feel free to contact me or comment over at that PR if interested.

Hi @ulmononian, @natalie-perlin, please use the following cmake flags to build SCOTCH for WW3 use on RDHPCS machines (hera, orion). cheyenne should be similar, though I don't have access to that machine to confirm.

      -DCMAKE_Fortran_COMPILER=mpiifort
      -DCMAKE_C_COMPILER=mpiicc
      -DCMAKE_CXX_COMPILER=mpiicc
      -DCMAKE_BUILD_TYPE=Release
      -DTHREADS=OFF
      -DMPI_THREAD_MULTIPLE=OFF

Also, I have been testing the spack-stack SCOTCH install on hera. I post my current status on that at noaa-emc/spack-stack/pull/550

natalie-perlin commented 1 year ago

@MatthewMasarik-NOAA - Could you please give a little more information on the environment you have when building the scotch library? In particular, output from the following queries, where each line below is a separate query:

module list
which mpiifort
mpiifort -show
which mpiicc
mpiicc -show
which gcc
which gxx   # or  which g++
MatthewMasarik-NOAA commented 1 year ago

Hi @natalie-perlin, sure thing.

module list

Currently Loaded Modules:
  1) cmake/3.20.1     4) hpc-intel/2022.1.2   7) gnu/9.2.0
  2) hpc/1.2.0        5) impi/2022.1.2
  3) intel/2022.1.2   6) hpc-impi/2022.1.2

which mpiifort
/apps/oneapi/mpi/2021.5.1/bin/mpiifort

mpiifort -show
ifort -I"/apps/oneapi/mpi/2021.5.1//include" -I"/apps/oneapi/mpi/2021.5.1/include" -L"/apps/oneapi/mpi/2021.5.1/lib/release" -L"/apps/oneapi/mpi/2021.5.1/lib" -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker "/apps/oneapi/mpi/2021.5.1/lib/release" -Xlinker -rpath -Xlinker "/apps/oneapi/mpi/2021.5.1/lib" -lmpifort -lmpi -ldl -lrt -lpthread

which mpiicc
/apps/oneapi/mpi/2021.5.1/bin/mpiicc

mpiicc -show
icc -I"/apps/oneapi/mpi/2021.5.1/include" -L"/apps/oneapi/mpi/2021.5.1/lib/release" -L"/apps/oneapi/mpi/2021.5.1/lib" -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker "/apps/oneapi/mpi/2021.5.1/lib/release" -Xlinker -rpath -Xlinker "/apps/oneapi/mpi/2021.5.1/lib" -lmpifort -lmpi -ldl -lrt -lpthread

which gcc
/apps/gnu/gcc-9.2.0/bin/gcc

which gxx   # or  which g++
/apps/gnu/gcc-9.2.0/bin/g++
natalie-perlin commented 1 year ago

@MatthewMasarik-NOAA @JessicaMeixner-NOAA - Updated/installed scotch 7.0.3 on Hera and Orion, with intel/2022.1.2 compilers. Please test these installations.

This is the configuration from the actual log files:


cmake VERBOSE=1 -DCMAKE_Fortran_COMPILER=mpiifort  
-DCMAKE_C_COMPILER=mpiicc -DMAKE_CXX_COMPILER=mpiicpc 
-DCMAKE_INSTALL_PREFIX=/work/noaa/epic-ps/role-epic-ps/hpc-stack/libs/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/scotch/7.0.3 -DCMAKE_BUILD_TYPE=Release -DTHREADS=OFF -DMPI_THREAD_MULTIPLE=OFF ..

Few additional notes: (FYI, @jkbk2004 , @ulmononian )

1) Bison version on Hera and Orion is 3.0.4. The minimum version, as @JessicaMeixner-NOAA pointed out in https://gitlab.inria.fr/scotch/scotch/-/issues/21 is 3.4. If bison/3.4 is a prerequsite, it may need to be added to hpc-stack-built packages, and become a part of the software modules under the stack.

2) The way scotch is built now in the intel-2022.1.2 stacks, it requires two different compiler modules that are generally supposed to be in conflict with one another. It means that you supposed to either use gnu/gcc compiler suite, or use intel. The way it is achieved on Hera and Orion was by modifying gnu compiler modulefiles so gnu compilers are no loger labeled as "family compiler", and thus Intel module is not aware of another compiler suite loaded and does not raise the flag about conflict. This approach was a quick fix and presented a workaround, but does not seem to be an ideal to further installation on other systems, or offering a clear way to build a scotch library as a part of hpc-stack for a general user. The hpc-stack was meant to have one designated compiler suite and one mpich/mpi libraries, and the user just need to focus on one compiler + one mpi working properly. When attempting to build without gnu, the error that pops up seem to be related to a formatting statements, in common_integer.c routine. So I wonder if the quicker and simpler solution would be to find a format fix that works with both gnu - if compiled with gnu suite- and intel compilers... as opposed to require two compiler suites and a workaround to prevent loading conflict

       Please let me know your thoughts or comments!
MatthewMasarik-NOAA commented 1 year ago

@MatthewMasarik-NOAA @JessicaMeixner-NOAA - Updated/installed scotch 7.0.3 on Hera and Orion, with intel/2022.1.2 compilers. Please test these installations.

@natalie-perlin Sure thing, I'll start testing these.

MatthewMasarik-NOAA commented 1 year ago

@natalie-perlin, for orion, there is a typo (ahelp vs. help) in the scotch lua file that prevents it from loading:

/work/noaa/epic-ps/role-epic-ps/hpc-stack/libs/intel-2022.1.2/modulefiles/mpi/intel/2022.1.2/impi/2022.1.2/scotch/7.0.3.lua

Thank you for the finding! Fixed both in the actual modulefile for scotch and in the template.

No such issue on Hera, checked.

MatthewMasarik-NOAA commented 1 year ago

Great, thank you for fixing those. I'll test it out

MatthewMasarik-NOAA commented 1 year ago

@natalie-perlin these new installs on hera and orion both pass standalone WW3 testing.

MatthewMasarik-NOAA commented 1 year ago

Hi @natalie-perlin. I have some good news, SCOTCH v7.0.4 was just released containing bug fixes for the scaling issue we saw, as well as an issue exposed when using gnu + openmpi, found by @AlexanderRichert-NOAA. Please find this new version here: https://gitlab.inria.fr/scotch/scotch/-/releases/v7.0.4

We are looking forward to having this updated version installed. Is there anything else I can provide to start the installations?

MatthewMasarik-NOAA commented 1 year ago

@natalie-perlin I wanted to amend my comment from yesterday, and see if we could pause this process for a few days?

With the new version released just yesterday I need some time to produce new build instructions. We have been using the safest SCOTCH compile options while debugging, though now we may be able to get some performance gains using different SCOTCH options. I'm testing different builds now, then I can pass you the complete instructions we decide on. My intention is to have those instructions ready by Monday.

MatthewMasarik-NOAA commented 1 year ago

Fyi @natalie-perlin For the SCOTCH v7.0.4 installs I created a new Install issue. Since this issue was for the initial Package Addition, and scotch/7.0.3 has been added hpc-stack and installed on both orion and hera, this may be complete?

I've posted the install request for scotch/7.0.4 at #526.

natalie-perlin commented 1 year ago

Hi @MatthewMasarik-NOAA - yes, it has been added and tested successfully on Orion, Hera (and other current hpc-stack locations). I hope the issue could be closed now

ulmononian commented 1 year ago

just fyi @MatthewMasarik-NOAA @natalie-perlin: scotch 7.4.0 will be included with spack-stack 1.5.0.

MatthewMasarik-NOAA commented 1 year ago

Hi @MatthewMasarik-NOAA - yes, it has been added and tested successfully on Orion, Hera (and other current hpc-stack locations). I hope the issue could be closed now

@natalie-perlin Yes, from my perspective this can be closed now.

MatthewMasarik-NOAA commented 1 year ago

just fyi @MatthewMasarik-NOAA @natalie-perlin: scotch 7.4.0 will be included with spack-stack 1.5.0.

@ulmononian Awesome! That's great news. I will be following up with the corresponding spack-stack install issue, it will be Monday.