NOAA-EMC / WW3

WAVEWATCH III
Other
254 stars 501 forks source link

tp2.10 MPI_OMPH test not reproducing with different number of MPI/OMP threads #321

Open JessicaMeixner-NOAA opened 3 years ago

JessicaMeixner-NOAA commented 3 years ago

When running:

./bin/run_test -b slurm -c hera -S -T -s MPI_OMPH -w work_MPI_OMPH -f -p mpirun -n 2 -t 6 ../model -o both ww3_tp2.10 with either different number of MPI tasks or OMP threads there are differences when comparing with matrix.comp.

JessicaMeixner-NOAA commented 3 years ago

@ukmo-ccbunney I remember you mentioning that you thought you had looked into this before. I just found a switch: B4B that's a double switch. I wonder if these are the updates that you made for this issue before?

ukmo-ccbunney commented 3 years ago

@JessicaMeixner-NOAA - yes - the B4B is a "bit 4 bit" switch. I added it to get around the b4b issues we originally had with that regtests (to do with a reduction in the SMC module). When enabled, the B4B switch scales and converts some real values to integer values to avoid rounding errors during the reduction.

It fixes the issue when running with the same number of PRs/threads (at least for us), but there maybe other parts of the code that cause issues when running with a different combination of PEs/threads...

aliabdolali commented 2 years ago

After updating the mpirun to srun on NOAA HPC, ww3_tp2.16/./work_MPI_OMPH is showing similar behavior. It is not b4b anymore. ./bin/run_test -b slurm -c hera.intel -S -T -s MPI_OMPH -w work_MPI_OMPH -f -p srun -n 4 -t 6 ../model -o both ww3_tp2.16

ukmo-ccbunney commented 2 years ago

Thanks for the update @aliabdolali I think these OMP B4B issues are due to reduction operations (which are not guaranteed to be B4B due to the order of the reduction not being guaranteed). The way I got around it un ww3_tp2.10 (for our system at least) was to perform the reduction on arrays that had been scaled and converted to integer values. It might be that something similar needs to be done for ww3_tp2.16.

I will investigate when I get more time.

JessicaMeixner-NOAA commented 2 years ago

@ukmo-ccbunney I noticed that we are do have the "B4B" flag in both the tp2.10 and the 2.16 OMPH cases, I had thought that the tp2.16 was an issue where something was not getting recompiled correctly, but I used the same switches for both tests and it was okay. (I thought this because if I ran the tp2.16 test from clean clones I could get b4b, but not if they were run after the tp2.10 test and using the tp2.10 switch did not help).

ukmo-ccbunney commented 2 years ago

@JessicaMeixner-NOAA I think the B4B implementation must no longer be working (or it is and there is some other issue). It did originally fix the b4b issue with the SMC propagation a few years ago, but it seems to have come back again and I am not sure why! Certainly, it seems to be implementation specific, as you have seen with your change of scheduler.