Closed mickaelaccensi closed 3 years ago
this happens only when : IOSTYP=2 or =3 PSHARE = T and an output restart file is requested
To test it with a regtest :
add a restart output in mww3_test_04/input/ww3_multi_grdset_d.nml
&OUTPUT_DATE_NML
ALLDATE%FIELD = '19680606 000000' '1200' '19680608 000000'
ALLDATE%POINT = '19680606 000000' '3600' '19680608 000000'
ALLDATE%RESTART = '19680606 020000' '3600' '19680606 020000'
/
then run the regtest :
./bin/run_test -o both -N -f -S -T -c datarmor_intel_debug -s PR1_MPI -w work_PR1_MPI_d -m grdset_d -f -p $MPI_LAUNCH -n 28 ../model mww3_test_04
it will never end..
when you kill it, here are the lines where it's locked :
ww3_multi 00000000012BC688 w3iorsmd_mp_w3ior 542 w3iorsmd.F90
ww3_multi 0000000000DAA809 w3wavemd_mp_w3wav 1413 w3wavemd.F90
ww3_multi 00000000008BA7CD wmwavemd_mp_wmwav 871 wmwavemd.F90
ww3_multi 0000000000405363 MAIN__ 150 ww3_multi.F90
line 542 in w3iorsmd.F90 is the MPI_WAITALL function :
IF ( IAPROC .EQ. NAPRST ) THEN
!
IH = 1 + NRQ * (IB-1)
CALL MPI_WAITALL &
( NRQ, IRQRSS(IH), STAT1, IERR_MPI )
bug introduced by commit e756361
@ukmo-ccbunney , @ukmo-juan-castillo , @ukmo-ansaulter, could you correct this bug ?
Hi Guys,
cold you please look at this bug ? I'm not able to upgrade my forecast system with the last version of ww3 due to this bug. thanks
Hi @mickaelaccensi Apologies for the delay - I am really struggling for time at the moment! I can confirm that this is hanging for me too on the GNU compiler when writing out the restart file. I'll chat with @ukmo-juan-castillo today and see if we can get to the bottom of it. Chris.
Sorry for the delay, I have been quite busy last week. I will start working on this now and give it all my priority. I think I know where the problem is and it should be easily fixed.
I run some tests and it looks like this bug was present before merging the new coupling changes. In any case, as these particular lines of code were in my list of things to look at during the optimization issue, I am trying to fix the problem.
I narrowed the problem to the communication handlers, that are somehow overwritten. This points to an 'out of bounds' error or similar. When I tried to compile in debug mode I obtained several errors. I reckon that fixing those errors will probably fix the problem.
So I just noticed that the test: run_test -s PR1_MPI -w work_PR1_MPI_e -m grdset_e -f -p mpirun -n 4 ../model mww3_test_03 hangs. It does not hang with other number of tasks, but with 4 tasks it will hang. The code in the PR https://github.com/ukmo-waves/WW3/pull/18 solves this problem.
@ukmo-ccbunney found that this bug fix also affect the oasis regtests. After careful examination, I have found a more satisfactory solution that solves the problems in both the 'multi' and the 'oasis' regtests. This bugfix will affect these configurations, and in particular it will change the restart file of multi configurations.
The changes will be made in the staging branch and tested there.
the bug appears when using IOSTYP 2 or 3, it works well with IOSTYP=1
By tracking where it keeps waiting, it seems to be for some processors in w3wavemd : CALL MPI_WAITALL due to positive value of NRQSG2
and for the dedicated output processor in w3iorsmd :
I'll look for a regtest that highlights the bug