anderkve / gambit_np

0 stars 1 forks source link

Freeze at MPI finalize #11

Open fzeiser opened 3 years ago

fzeiser commented 3 years ago

The calculation run fine now, but seem to freeze at MPI_Finalize:

out-2691589-0

[...]
Loading Diver differential evolution plugin for ScannerBit.
Starting Diver run...
Diver run finished!
ScannerBit is waiting for all MPI processes to report their shutdown condition...
Final dataset size is 1050765

GAMBIT has finished successfully!

Calling MPI_Finalize...

From the default_0.log it seems like everything ran successfully, too (+ I haven't received any errors):

[fabiobz@login-1.FRAM /cluster/projects/nn9464k/progs/gambit_np/submit]$ tail -20 runs/NuclearBit_162Dy_43_1/logs/default.log_0 
HDF5Printer2 output finalisation complete.
--<>--<>--<>--<>--<>--<>--<>--
(Wed May 19 20:39:34 2021)(5225.13 [s])(Rank 0)[Default]:
GAMBIT run completed successfully.
--<>--<>--<>--<>--<>--<>--<>--
(Wed May 19 20:39:34 2021)(5225.14 [s])(Rank 0)[Default,Core][Info]:
NO_MORE_MESSAGES code broadcast to all processes
--<>--<>--<>--<>--<>--<>--<>--
(Wed May 19 20:39:34 2021)(5225.14 [s])(Rank 0)[Default,Core][Info]:
Cleaning up shutdown message send buffers
--<>--<>--<>--<>--<>--<>--<>--
(Wed May 19 20:39:34 2021)(5225.17 [s])(Rank 0)[Default]:
All shutdown messages successfully Recv'd on this process!
--<>--<>--<>--<>--<>--<>--<>--
(Wed May 19 20:39:34 2021)(5225.17 [s])(Rank 0)[Default]:
Calling MPI_Finalize...
--<>--<>--<>--<>--<>--<>--<>--
(Wed May 19 20:39:37 2021)(5227.61 [s])(Rank 0)[Default]:
MPI successfully finalized!
--<>--<>--<>--<>--<>--<>--<>--

The job took about ~1.5 h to process; but as I did know how long it would take I set the max runtime to 2 days. The job then ran in total 2 days before it was cancelled. That's a waste of resources :disappointed: .

I assume that this is connected to the combination of MPI version / compiler / .... If you had a good idea @anderkve, please let me know. Otherwise I might just live with this, and constrain the runtimes harder. (I don't want to go through the compiler / module dependecy hell again).

Another workaround could be to have a job that surveys the log file and send a signal to gambit once (some minutes after?) the logger prints MPI sucessfully finalized.