WarwickMicroscopy / Felix

Felix: Bloch wave method diffraction pattern simulation software
16 stars 10 forks source link

Current error handling fails if only one core has the error #143

Closed JacobJoseph-gu closed 7 years ago

JacobJoseph-gu commented 7 years ago

I have replicated a few cases of this. An error occurs in a subroutine on one core only, it will fall back to felixrefine and try to call MPI_Finalize to abort but felix will hang. The other core keeps using CPU without any output to terminal.

Below is an example terminal output of this hanging. rank 1 has the error while rank 0, the defualt terminal output core, continues as usual. Lines starting '@' are rank 0. lines starting '1 = rank' are the error messages from rank 1. rank 1 is calling MPI_Finalize and it is hanging without closing.

@ -------------------------------------------------------------- @ felixrefine: Version: multipole / BUILD / Alpha
@ Date: 27-06-2017
@ Status: multipole atom test & debug
@ on rank = 000 out of the total = 002 @ -------------------------------------------------------------- @ Mode D selected: Refining Isotropic Debye Waller Factors @ Mode selected: Refining by pairwise maximum gradient @ Setting teminal output mode 1 = rank, error in SpecificReflectionDetermination(): No requested HKLs are allowed using the purposed geometry 1 = rank, error in felixrefine after attempting to SpecificReflectionDetermination() 1 = rank, calling MPI_Finalize ----@ Number of experimental images successfully loaded = 069 ----@ MeanInnerPotential(Volts) = +5.777E+01 ----@ number of unique structure factors = 085

There is MPI_ABORT() if we need it as well as MPI_Finalize() settings. I shall look into but I thought It was worth mentioning it here nonetheless.

rbeanland commented 7 years ago

This is probably caused by too few strong beams. If so It may be worth putting a lower limit of 2x the number of experimental images.

JacobJoseph-gu commented 7 years ago

I manually created this error and tested similar one branch errors with other subroutines.

After researching, I've added MPI_Abort() error handling to the top level of felixrefine, which will shut down all cores in the case of an error on any core. If all cores finish without error, like normal, then they will gracefully MPI_Finalise() like before.

I shall also add in the lower limit for strong beams