NOAA-EMC / WW3

WAVEWATCH III
Other
258 stars 521 forks source link

ww3_multi hangs on when creating restart file with IOSTYP=2 or =3 #290

Closed mickaelaccensi closed 3 years ago

mickaelaccensi commented 3 years ago

the bug appears when using IOSTYP 2 or 3, it works well with IOSTYP=1

By tracking where it keeps waiting, it seems to be for some processors in w3wavemd : CALL MPI_WAITALL due to positive value of NRQSG2

and for the dedicated output processor in w3iorsmd :

!/MPI                        CALL MPI_WAITALL                         &
!/MPI                           ( 1, IRQRSS(IB), STAT1, IERR_MPI )

I'll look for a regtest that highlights the bug

mickaelaccensi commented 3 years ago

this happens only when : IOSTYP=2 or =3 PSHARE = T and an output restart file is requested

To test it with a regtest :

add a restart output in mww3_test_04/input/ww3_multi_grdset_d.nml

&OUTPUT_DATE_NML
  ALLDATE%FIELD          = '19680606 000000' '1200' '19680608 000000'
  ALLDATE%POINT          = '19680606 000000' '3600' '19680608 000000'
  ALLDATE%RESTART        = '19680606 020000' '3600' '19680606 020000'
/

then run the regtest :

./bin/run_test -o both -N -f -S -T -c datarmor_intel_debug -s PR1_MPI -w work_PR1_MPI_d -m grdset_d -f -p $MPI_LAUNCH -n 28 ../model mww3_test_04

it will never end..

when you kill it, here are the lines where it's locked :

ww3_multi          00000000012BC688  w3iorsmd_mp_w3ior         542  w3iorsmd.F90
ww3_multi          0000000000DAA809  w3wavemd_mp_w3wav        1413  w3wavemd.F90
ww3_multi          00000000008BA7CD  wmwavemd_mp_wmwav         871  wmwavemd.F90
ww3_multi          0000000000405363  MAIN__                    150  ww3_multi.F90

line 542 in w3iorsmd.F90 is the MPI_WAITALL function :

                    IF ( IAPROC .EQ. NAPRST ) THEN
!
                        IH     = 1 + NRQ * (IB-1)
                        CALL MPI_WAITALL                         &
                           ( NRQ, IRQRSS(IH), STAT1, IERR_MPI )

bug introduced by commit e756361

@ukmo-ccbunney , @ukmo-juan-castillo , @ukmo-ansaulter, could you correct this bug ?

mickaelaccensi commented 3 years ago

Hi Guys,

cold you please look at this bug ? I'm not able to upgrade my forecast system with the last version of ww3 due to this bug. thanks

ukmo-ccbunney commented 3 years ago

Hi @mickaelaccensi Apologies for the delay - I am really struggling for time at the moment! I can confirm that this is hanging for me too on the GNU compiler when writing out the restart file. I'll chat with @ukmo-juan-castillo today and see if we can get to the bottom of it. Chris.

ukmo-juan-castillo commented 3 years ago

Sorry for the delay, I have been quite busy last week. I will start working on this now and give it all my priority. I think I know where the problem is and it should be easily fixed.

ukmo-juan-castillo commented 3 years ago

I run some tests and it looks like this bug was present before merging the new coupling changes. In any case, as these particular lines of code were in my list of things to look at during the optimization issue, I am trying to fix the problem.

I narrowed the problem to the communication handlers, that are somehow overwritten. This points to an 'out of bounds' error or similar. When I tried to compile in debug mode I obtained several errors. I reckon that fixing those errors will probably fix the problem.

JessicaMeixner-NOAA commented 3 years ago

So I just noticed that the test: run_test -s PR1_MPI -w work_PR1_MPI_e -m grdset_e -f -p mpirun -n 4 ../model mww3_test_03 hangs. It does not hang with other number of tasks, but with 4 tasks it will hang. The code in the PR https://github.com/ukmo-waves/WW3/pull/18 solves this problem.

ukmo-juan-castillo commented 3 years ago

@ukmo-ccbunney found that this bug fix also affect the oasis regtests. After careful examination, I have found a more satisfactory solution that solves the problems in both the 'multi' and the 'oasis' regtests. This bugfix will affect these configurations, and in particular it will change the restart file of multi configurations.

The changes will be made in the staging branch and tested there.