gandalfcode / gandalf

GANDALF (Graphical Astrophysics code for N-body Dynamics And Lagrangian Fluids)
GNU General Public License v2.0
44 stars 12 forks source link

Mpi meshlessclean #92

Closed giovanni-rosotti closed 7 years ago

giovanni-rosotti commented 7 years ago

MPI is working now as far as I can tell (although I didn't test all the edge cases yet). Gravity needs to zero a and gpot, which was not done previously. I temporarily did it just before the force calculation, but there should be a better place clearly. I guess EndTimestep would be the right place, can you confirm?

I'll also go one last time through the main loop and check that the ghosts are always up to date (and remove unnecessary updates), we are still doing too many updates (possibly also failing to do necessary ones, although I should have fixed that).

giovanni-rosotti commented 7 years ago

So I just got rid of some useless calls to the update of the ghosts. There is still a call just before the h calculation which I also think is useless. Computing H essentially needs only the position of the other particles, and this is not changing since the last update. Other things that might change are gpot if we have gravity, in which case we should move the check on potmin in the gradient calculation, and the mass, but only if we have sink accretion. But in the general case there is no need to do any update, and we could move the update of the ghosts soon after sink accretion (see next question). Do you agree?

The other thing I am wondering about is the rebuilding of the tree after sink accretion. Surely we should force a restock rather than a rebuild if we are on a tree rebuild step, as rebuilding the tree again doesn't make sense - we have just done it! In any case, I think that the reason we are restocking the tree is that the particle mass has changed (note that this is relevant only if we have self-gravity). In this case, we need to update also the ghosts and the mpi ghosts (otherwise what's the point in rebuilding the ghost tree?), don't we? We need it also because the h calculation depends on the mass of the particles (see previous point). Finally, we need to restock also the mpi ghosts, which we are not doing at the moment.

rbooth200 commented 7 years ago

Despite my earlier comments I actually agree that by changing when we compute potmin we can get rid of a copy to ghosts. Note this will also change to potmin to be a scatter-gather mininum, but I don't see a problem, actually I think it is more consistent.

Also, we need to restock the tree afer the sinks because the mass has changed. I agree that a rebuild is unnecessary. But is it possible to avoid it at all? For example if we call the sink routines before AdvanceParticles then we have a valid tree for the sink tree walk. Afterwards we move the particles, which doesn't care about the tree and then rebuild?

giovanni-rosotti commented 7 years ago

So I got rid of other unnecessary calls for updating the ghosts as discussed. We decided to leave the sinks as they are for now (but I did add other updates in order to be correct); I'll create an issue on sinks optimization (if I remember correctly it's not done as efficiently as it could) as probably the best thing to do is moving sink accretion in another part of the loop as suggested by Richard.

I realised also that it's not necessary to update the periodic ghosts after the h calculation (so I got rid of that call as well) as in the gradients we are creating ghosts on the fly. With MPI we need to to do It because mpi ghosts might be created from periodic ghosts. I do agree this is quite horrible (and it makes me wonder why we have the periodic ghosts at all), but it works. Are there strong arguments for using the periodic ghosts instead?

giovanni-rosotti commented 7 years ago

Note also that gpot in the interface of ComputeH for the meshless is now useless... I kept it just in case in the future we decide to merge UpdateAllProperties of SPH with the one for the meshless

rbooth200 commented 7 years ago

Note also that gpot in the interface of ComputeH for the meshless is now useless... I kept it just in case in the future we decide to merge UpdateAllProperties of SPH with the one for the meshless.

OK. This is a problem we can worry about later, for sure.

Are there strong arguments for using the periodic ghosts instead? We could get rid of them, but I suspect that it will make the density calculation more expensive. We'd save time by not creating ghosts and building the tree so it's not clear it will make the whole code faster or slower. I think the on-the-fly ghost generation could be made more efficient though.

For now I'd only change compute H to use on the fly ghosts if it would improve correctness.

rbooth200 commented 7 years ago

I've pushed a couple of very minor changes that are not worthy of note.

Otherwise, the pure hydro tests seem to be working with/without ghosts. However, there is a problem with the sinks and also with gravhydro tests such as the Evrard

giovanni-rosotti commented 7 years ago

I've fixed the problems with gravity on more than 2 processors (I was sending the already computed acceleration to the other processors, possibly resulting in overcounting) and I've moved the zero-ing of acceleration and potential to EndTimestep. The evrard test now works without problems with individual timesteps and 4 MPI processes. I will now give a look at the sinks.

giovanni-rosotti commented 7 years ago

A more fundamental problem: doesn't the h recalculation invalidate the pruned tree? We use it to do stuff like computing the forces in SPH, gravity in the meshless etc. Are we just relying on the safety factors to guarantee that all the right neighbours will be found?

rbooth200 commented 7 years ago

I think the h-update must invalidate the pruned tree, as does moving particles around etc. But in SPH we just rely on the safety factors. I don't much like this, we should have some checks in place, but it's what we do.

giovanni-rosotti commented 7 years ago

Ok, but hopefully this is less of a problem than not restocking. That is a serious issue; this will go wrong only in pathological cases. Is there a way to check after recomputing h if the pruned tree we have sent has been invalidated?

On a different note, I fixed some more bugs in the MPI, including a very nasty memory bug that was very difficult to track. I really wish we had used vectors since the start...

rbooth200 commented 7 years ago

Is there a way to check after recomputing h if the pruned tree we have sent has been invalidated?

The best way to do this would be to use the bounding box of ghost trees. As long as the smoothing length does not overlap a region that we don't have particles for, we are ok. Perhaps the most efficient way to do this is to update the hboxes of the local tree and use that for checking. The advantage of this method is we would know when we needed to create more ghosts automatically - we could use smaller safety factors and potentially update the ghosts less often.

rbooth200 commented 7 years ago

Pure hydro and grav hydro tests appear to be working fine to me. Haven't tested the sinks.

dhubber commented 7 years ago

A more fundamental problem: doesn't the h recalculation invalidate the pruned tree? We use it to do stuff like computing the forces in SPH, gravity in the meshless etc. Are we just relying on the safety factors to guarantee that all the right neighbours will be found?

For computing h and for hydro forces, yes we're in effect relying on the safety factors to 'guarantee' that the pruned tree is valid to compute the correct h and also so we know which MPI processes we should be communicating to for hydro forces. It's a similar hack to what we do with the periodic ghosts. Obviously it's not completely robust but in practice seems to work. However, I'm sure there's a better way out there somewhere (that doesn't involve excessive MPI communication that is).

Is there a way to check after recomputing h if the pruned tree we have sent has been invalidated?

Hmmmm, again not sure without more communication. I was thinking we could at least check that the pruned tree is valid when we send the particles to other nodes for hydro forces. However, this does not include the case where we should have sent a particle to a particular node but did not because the (incorrect) pruned tree did not flag we should send it.

I'm not sure if Richard knows about this, but the original plan was to evolve/extrapolate the cell properties in order to avoid communicating the pruned tree every timestep. As a simple example, if we know the COM velocity of a cell then we can extrapolate the COM position. We could also compute other things like dh/dt (or dhmax/dt) and even dQ/dt (the quadrupole moment tensor time derivative, there's something about this in Sverre's N-body book). I was thinking we should at least do the first two but since we never got to nrebuildstep > 1 for MPI, this was never implemented. It's one (approximate) solution that reduces the amount of MPI communication. However, as with the safety factors, I am sure there's a better way...

giovanni-rosotti commented 7 years ago

I referenced this discussion about how to update h in the tree in the issue tracker - it's not something we are going to solve in this pull request, which is about MPI with the meshless. David, can you please let me know if you have found problems in this pull request? Otherwise I think we should accept it and move on

dhubber commented 7 years ago

So, I had a look through the code and the changes seem fine. I also run a selection of our tests and they mainly all seemed to work correctly. The only one that failed was the Boss-Bodenheimer with the meshless scheme, but I think this was due to a bug that's been addressed in the gravity_debug2 branch. So, maybe we should (hopefully soon) merge the gravity_debug2 branch first, then rebase this one with master and check it all fits together fine and if that's okay, we can merge this branch also? I'm getting to the last gravity_debug2 issues right now so hopefully this can all be done today.

giovanni-rosotti commented 7 years ago

The Boss bodenheimer was running fine on my machine, but maybe you integrated for longer than me. Did you get up to the point of sink formation?

In any case, last week we actually said that this branch was the one with the highest priority over all the other ones, and in particular it's holding off the timing branch.But if we are sure that the situation is the one you describe, then I agree it's a good reason to switch priority in order to avoid having buggy code in master

dhubber commented 7 years ago

ok, I was in the same mode as when I was testing the tests branch, so hyrbid-parallelisation, MPI with OpenMP with 4 processes and 4 threads, so 16 threads in total. When I switched to 1 thread per MPI task, it runs fine. So it's more a hybrid issue rather than an MPI issue then.

rbooth200 commented 7 years ago

This might be the same bug as in the tests branch?