Open rbooth200 opened 7 years ago
Ok, I've rebased and removed the hotfix. Now we only need @giovanni-rosotti's fix for FractionBoxOverlap
I've added these fixes, it should now be ready to go...
Well, I guess this is basically close to ready. If it passes Travis, then I guess it's time to merge into master. But who's going to do the honours since we've all contributed heavily to this pull request? :-)
Giovanni plans to go through this carefully first I think...
Okay, looking at it, Giovanni made the fewest commits so perhaps it's better he does it to give most of the changes better scrutiny!
Yes, I am looking at it. I'm almost finished actually...
Btw, why isn't Travis running tests anymore?
I think there was a problem with travis earlier. Its meant to be fixed now so lets see what happens when I push the changes.
Ok, I've made the changes I'm responsible for.
Travis seems to be working again, too.
ok, I've made the change to Sinks.cpp now. Basically I've modified the first two loops and put them in a master block with only the larger section (that actually does most of the work) in the parallel dynamic block. There are perhaps ways to improve this even more (e.g. moving the master block region out of the omp region entirely) but this addresses the immediate issue hopefully.
Why don't we parallelise over the number of sinks using OpenMP, in the same way that we do over cells for the normal tree walks?
Are you talking about the accrete loop? Because it does loop over sinks. Or are you talking about other places like when computing gravity for sinks?
On Fri 30. Jun 2017 at 16:34, Richard Booth notifications@github.com wrote:
Why don't we parallelise over the number of sinks using OpenMP, in the same way that we do over cells for the normal tree walks?
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/gandalfcode/gandalf/pull/157#issuecomment-312283520, or mute the thread https://github.com/notifications/unsubscribe-auth/AEsg1bImqmPsD-tiqJvc_amsSjrlaA-Tks5sJQdbgaJpZM4OJB3D .
The loop over sinks in the master region.
Because there was this line 'plist[Ninsink++] = i;' which I wasn't sure I could do with atomics or without criticals. But I'll probably re-write it another way anyway in the future so it's not too important.
Is there any reason we've not merged this yet? I know there was this strange issue with Travis marginally failing a test randomly, but do we have any reason to suspect the cause of this originated in this branch? If not, then we should merge asap. But Giovanni should still be the final judge of this of course! :-)
Sorry, I had taken some vacation from GANDALF! :) I think there are two issues left: a) the right value to use for ghost_range; b) the parallelization for the sinks, that I need to catch up with. I am happy to delay the latter to a separate issue, but I feel that we can decide the former immediately, it's not such a big decision I would say... is there any reason why we should NOT go back to 2.5?
I'm happy with 2.5
This has failed because there is a ~ 1 per cent random fluctuation in the error norm when running with MPI. The SPH tolerence in the error norm is very close to the actual error in the serial case, so it's causing random fails.
Thoughts?
Do your recent changes to MPI change anything with respect to the small difference in the error norm?
Do your recent changes to MPI change anything with respect to the small difference in the error norm?
If you mean my changes, they were for a separate issue but they push the error norm just above the threshold as before so it's probably that ghost range issue again. Was this ghost range issue fixed/discussed on another branch/pull request by any chance? If we don't know what the cause is, or how to fix it, the simplest 'fix' is to increase the maximum error norm value for Travis to pass these tests. I don't like that solution particularly but not sure what else to suggest otherwise.
No I meant Richard's recent changes (didn't know you were doing any modification to MPI, David!)
No I meant Richard's recent changes (didn't know you were doing any modification to MPI, David!)
No, I wasn't doing any direct MPI modifications but the timesteps bug I fixed was a potential MPI bug (in the way the global min/max timesteps were reduced over MPI) so wasn't sure if you were talking about that.
What do you mean by recent? Changing the ghost range doesn't solve the issue. Otherwise my changes in the MPI/Sinks branch only affect sink particles.
Changing the ghost range doesn't solve the issue.
It doesn't solve the issue but does seem to affect the final error and push it just below the threshold and so passes the Travis tests. So either the ghost range is actually not big enough or there's something else at play (probably the latter). Still scratching my head for ideas to solve it. On the other hand, if it's not an issue that's just with this branch (we had similar issues with other branches I remember), then shouldn't we find a way to finally merge this branch and then deal with this problem on a separate branch?
Actually, I don't think that its the ghost range, the result is quite random after all. Sometimes it passes, sometimes it fails with both values of the ghost range
I've only ever seen it fail with the smaller value (2.5) though. I agree it's not as simple as that (as I wrote above) and maybe it's a race condition, either in particle ordering of computing forces or an MPI summation? Maybe with MPI it leads sometimes to another h compute leading to slightly different h's and therefore slightly different errors? If it's simply this, the acceptable error tolerance is way to close to the computed tolerance and should be increased. But if it were the case, it would also be nice to try and guarantee reproduceable results (probably impossible with our MPI scheme though).
I definitely remember seeing it fail with both. I don't have any evidence any more though due to a rebase...
The MPI is always going to be a bit non-deterministic due to the load balancing, so small fluctuations have to be expected. I don't know how small though, and ~1% seems on the high side to me so I suspect that there might be a bug somewhere.
Let's merge all of these optimizations from the investiagtions during the paper. I'm going to rebase this (if possible) and roll back the hot-fix for MPI...