MPI issues in the test suite

danieljprice commented 4 years ago

raised by @jameswurster:

The good news is that all my commits now pass the MPI testsuite.

The bad news is that I think we are hiding a bug rather than actually fixing it. Specifically, I had to undo the change where dtclean -> 0.5*dtclean only if we are using overcleaning. It seems that this extra factor of two is inexplicably required in MPI (suggesting a bug somewhere).

While looking, I found another issue (not sure how frequently it appears since it was previously only displayed in a conditional print statement (for now, I have forced the print statement to be always on)). Specifically, after the sedov test, we have npart = 2048, nptot = 4096 so there are -2048 dead particles. Since I'm was running with 2 MPI processes, it appears that both processes are doing all of energies and then summing the results. This doubling of values may account for the lack of momentum conservation when using a slightly larger timestep.

energies.F90 seems to be called by each process, but the limits (for non-sink particles) is do 1 = 1,npart. Is this correct, or should this range vary with the number of threads, or should energies be called only by id==master? On a similar note, I see there is a do i = 1,npart if force.F90 when we are searching for gas particles to become sink candidates. Since there is no summing in that case, there should be no problem, even if we are duplicating the search.

danieljprice commented 4 years ago

@jameswurster the sums over 1,npart are correct. With MPI npart is the number of particles LOCAL to each MPI process. nptot is the total number of particles across all threads.

Each thread should sum from 1 to npart. But then for things like total energy we have a reduceall_mpi('+',etot) which sums the result over all threads. So I think the energies behaviour is correct.

the 0.5*dtclean thing might be hiding another timestep criterion that is too loose, possibly in the radiation timestep.

jameswurster commented 4 years ago

Just to clarify, if my simulation has 1000 particles and I'm running two MPI threads, does that mean that in energies.F, npart = 500? If npart = 1000 and all threads are performing the calculations, then the results should be fine for averaged values, but wouldn't they be doubled in this example for summed values? I'm just trying to get my head around how the MPI is working for energies.

I agree that the 0.5dtclean may be hiding other issues, but we were having intermittent MPI failures before the radiation was included, if memory serves. I thought we has previously concluded that there was an issue with summing between / communicating over threads...

danieljprice commented 4 years ago

correct, npart should be (approximately) 500 if there are 1000 particles in total and 2 MPI threads. Each thread does a partial sum (over npart) and then the totals are added across all threads using the reduce all operation.

jameswurster commented 4 years ago

I have tested this a little more and I do get the same energies, etc. from the Sedov test using and not using MPI. I was getting confused since there was a bug in write_evlog where the values of npart & ndead were from a single thread but nalive was summed across all threads. I have corrected this, but not yet issued a pull request.

While looking at this, I further see that the number of particles per ibin is wrong in the MPI version: The printed values are actually from a single thread. In utils_indtimesteps.F90, the reduction statements that would correct this have been commented out. Is there a reason that they are not active? I don't mind fixing this, but please let me know if there is a reason that they are commented out.

Cheers

danieljprice commented 4 years ago

I think there might have been a hang, because the routine was only called with (id==master) meaning it is only run by the master thread. So feel free to put the reduce statements back in, but just need to ensure the routine is called by all MPI threads

On 3 Jul 2020, at 2:11 am, jameswurster notifications@github.com wrote:

I have tested this a little more and I do get the same energies, etc. from the Sedov test using and not using MPI. I was getting confused since there was a bug in write_evlog where the values of npart & ndead were from a single thread but nalive was summed across all threads. I have corrected this, but not yet issued a pull request.

While looking at this, I further see that the number of particles per ibin is wrong in the MPI version: The printed values are actually from a single thread. In utils_indtimesteps.F90, the reduction statements that would correct this have been commented out. Is there a reason that they are not active? I don't mind fixing this, but please let me know if there is a reason that they are commented out.

Cheers

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

jameswurster commented 4 years ago

I have made the changes and issued a pull request.
The routine already was called by all threads, so I only needed include the commented-out lines. My openmp and mpi runs now produce the same results (i.e. energies and particle counts).

The initial mpi bug remains, since 0.5dtclean is still required to pass the MPI testsuite.

danieljprice commented 3 years ago

I think this is now closed/fixed by recent pushes. Please reopen if not

danieljprice / phantom

MPI issues in the test suite #19