DOI-USGS / COAWST

COAWST modeling system git repository
Other
100 stars 48 forks source link

Error message of PMPI_Send, while running WRF+WW3+ROMS #269

Closed SharknadoBear closed 2 weeks ago

SharknadoBear commented 3 weeks ago

Hi John:

I am running a 3-way coupled configuration of COAWST 3.8, which is WRF+WW3+ROMS. I am running it with SCRIP grid interpolation. I can now running WW3+ROMS, or WRF only. But I met the following fatal error during the 3-way coupling:

MPICH ERROR [Rank 2] [job id 4271175.0] [Mon Jun 10 09:58:18 2024] [x1007c0s0b0n1] - Abort(134864900) (rank 2 in comm 0): Fatal error in PMPI_Send: Invalid tag, error stack:
PMPI_Send(163): MPI_Send(buf=0x7fffa128f578, count=1, MPI_INTEGER, dest=0, tag=1702126957, MPI_COMM_WORLD) failed
PMPI_Send(101): Invalid tag, value is 1702126957

slurmstepd: error: *** STEP 4271175.0 ON x1007c0s0b0n1 CANCELLED AT 2024-06-10T09:58:18 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: x1007c0s0b0n1: tasks 0-2: Killed

Do you have any idea on what might cause such error? I enclosed my COAWST output and COAWST error document. Now I test it with 1 core for each model, I had test with more cores and more memory on the HPC I use, same mistake happens.

Also, I noticed that there are two weird things in the COAWST.output file: the first is at first few lines it indicates that it cannot open the wrf input file, however, it does open the namelist.input and output the 'namelist.output'. The second thing is that the coupling step statement for ATMgrid vs. OCNgird or ATMgrid vs. WAVgrid is not printed. See below:

Coupled Input File name = coupling_WFIP3.in

 READ MODEL INPUTS - Unable to open wrf input file.

 READ MODEL INPUTS - Unable to open wrf input file.

 READ MODEL INPUTS - Unable to open wrf input file.

 MyRank =            0  

 Coupled Input File name = coupling_WFIP3.in

 Model Coupling: 

       Ocean Model MPI nodes: 00000 - 00000

       Waves Model MPI nodes: 00001 - 00001

       Atmos Model MPI nodes: 00002 - 00002

       WAVgrid 01 dt=  40.0 -to- OCNgrid 01 dt=   7.5, CplInt:   600.0 Steps: 00015

       OCNgrid 01 dt=   7.5 -to- WAVgrid 01 dt=  40.0, CplInt:   600.0 Steps: 00080
--------------------------------------------------------------------------------
 Model Input Parameters:  ROMS/TOMS version 4.1  
                          Monday - June 10, 2024 -  9:57:55 AM
--------------------------------------------------------------------------------

Maybe it is my input file problem? I also include my namelist.input/ocean.in/coupling.in/ww3_shel.inp/ww3_grid.inp and compiling header file for your reference.

COAWST.output.txt COAWST.ERROR.txt ww3_grid.inp.txt ww3_shel.inp.txt coupling_WFIP3.in.txt namelist.input.txt ocean_WFIP3.in.txt wfip3.h.txt

Thank you for your help!

jcwarner-usgs commented 3 weeks ago

something strange is going on. are you having trouble with disk space?
If it see the Atmos Model MPI nodes, then it should write out the ATMgrid stuff. Can you recompile and make sure you are pointing to the correct coawstM? -j

SharknadoBear commented 3 weeks ago

No, I am very sure I am not having trouble with disk space, because I have ongoing other runs dumping to the same project folder without any trouble.

I don't think I am pointing to the wrong coawstM....but good suggestion, I am now suspecting there are some compiling problem going on because on our HPC I didn't get the distributed memory WRF-only running (however I can get the share memory WRF-only run....). The 3way coupled coawstM is distributed memory, I selected 15 for that, and some other option like 66 also compiles on our machine but same error spit out from those coawstM.

I will recompile coawstM in a moment and this time I will save the compiling info output to you, maybe that can be helpful as well.

jcwarner-usgs commented 3 weeks ago

you need to use the DIST = MPI WRF settings. dont use Shared.

SharknadoBear commented 3 weeks ago

Here is my compile record, I used the dmpr for WRF.

compile.txt

SharknadoBear commented 3 weeks ago

Now I wish to try compile WRF by mpiifort (it seems to work better on our HPC environment, now ROMS and WW3 above you can see is compiled by mpiifort, but WRF is not), but I don't know how to change it. Is it at /WRF/configure? Could you remind me how I can change it for WRF?

jcwarner-usgs commented 3 weeks ago

i think WRF/arch/configure is a place to modify the individual builds.

SharknadoBear commented 2 weeks ago

Hi John:

I think I figured it out. It is a ATM_name read-in problem. Aname is not read-in properly, so the WRF is not started at all. For some reason, my compiling cannot recognize the comments line properly in the 'coupling.in' file. So if I input like this:

! Enter names of Atm, Wav, and Ocn input files.
! The Wav program needs multiple input files, one for each grid.

   ATM_name = namelist.input                                                                  ! atmospheric model
!  WAV_name = Projects/Sandy/swan_WFIP3.in \
!             Projects/Sandy/swan_WFIP3_ref3.in                                               ! wave model
   WAV_name = ww3_grid.inp
   OCN_name = /projects/owrs/yhuang168/COAWST/COAWST-main/Projects/WFIP3/CURRENT_WAVE/ocean_WFIP3_JUN.in ! ocean model
   HYD_name = hydro.namelist                                                                  ! hydro model

The COAWST will think the ATM_name is commented.

I need to remove the WAV_name comments and other comments and do this:

! Enter names of Atm, Wav, and Ocn input files.
! The Wav program needs multiple input files, one for each grid.

   ATM_name = namelist.input
   WAV_name = ww3_grid.inp
   OCN_name = ocean_WFIP3.in
   HYD_name = hydro.namelist

! Sparse matrix interpolation weights files. You have 2 options:
! Enter "1" for option 1, or "2" for option 2, and then list the
! weight file(s) for that option.

Then WRF can start properly.

Again, thank you for your help! I think this problem can be closed.

jcwarner-usgs commented 2 weeks ago

Wow! that is strange. have not seen that one yet. just make sure you have no Tabs. that can throw the line decoder off. Glad you figured it out. i am going to close the issue. -j