Improve openmp2 - Githubissues

rbooth200 commented 7 years ago

Let's merge all of these optimizations from the investiagtions during the paper. I'm going to rebase this (if possible) and roll back the hot-fix for MPI...

rbooth200 commented 7 years ago

Ok, I've rebased and removed the hotfix. Now we only need @giovanni-rosotti's fix for FractionBoxOverlap

rbooth200 commented 7 years ago

I've added these fixes, it should now be ready to go...

dhubber commented 7 years ago

Well, I guess this is basically close to ready. If it passes Travis, then I guess it's time to merge into master. But who's going to do the honours since we've all contributed heavily to this pull request? :-)

rbooth200 commented 7 years ago

Giovanni plans to go through this carefully first I think...

dhubber commented 7 years ago

Okay, looking at it, Giovanni made the fewest commits so perhaps it's better he does it to give most of the changes better scrutiny!

giovanni-rosotti commented 7 years ago

Yes, I am looking at it. I'm almost finished actually...

giovanni-rosotti commented 7 years ago

Btw, why isn't Travis running tests anymore?

rbooth200 commented 7 years ago

I think there was a problem with travis earlier. Its meant to be fixed now so lets see what happens when I push the changes.

rbooth200 commented 7 years ago

Ok, I've made the changes I'm responsible for.

Travis seems to be working again, too.

dhubber commented 7 years ago

ok, I've made the change to Sinks.cpp now. Basically I've modified the first two loops and put them in a master block with only the larger section (that actually does most of the work) in the parallel dynamic block. There are perhaps ways to improve this even more (e.g. moving the master block region out of the omp region entirely) but this addresses the immediate issue hopefully.

rbooth200 commented 7 years ago

Why don't we parallelise over the number of sinks using OpenMP, in the same way that we do over cells for the normal tree walks?

dhubber commented 7 years ago

Are you talking about the accrete loop? Because it does loop over sinks. Or are you talking about other places like when computing gravity for sinks?

On Fri 30. Jun 2017 at 16:34, Richard Booth notifications@github.com wrote:

Why don't we parallelise over the number of sinks using OpenMP, in the same way that we do over cells for the normal tree walks?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/gandalfcode/gandalf/pull/157#issuecomment-312283520, or mute the thread https://github.com/notifications/unsubscribe-auth/AEsg1bImqmPsD-tiqJvc_amsSjrlaA-Tks5sJQdbgaJpZM4OJB3D .

rbooth200 commented 7 years ago

The loop over sinks in the master region.

dhubber commented 7 years ago

Because there was this line 'plist[Ninsink++] = i;' which I wasn't sure I could do with atomics or without criticals. But I'll probably re-write it another way anyway in the future so it's not too important.

dhubber commented 7 years ago

Is there any reason we've not merged this yet? I know there was this strange issue with Travis marginally failing a test randomly, but do we have any reason to suspect the cause of this originated in this branch? If not, then we should merge asap. But Giovanni should still be the final judge of this of course! :-)

giovanni-rosotti commented 7 years ago

Sorry, I had taken some vacation from GANDALF! :) I think there are two issues left: a) the right value to use for ghost_range; b) the parallelization for the sinks, that I need to catch up with. I am happy to delay the latter to a separate issue, but I feel that we can decide the former immediately, it's not such a big decision I would say... is there any reason why we should NOT go back to 2.5?

rbooth200 commented 7 years ago

I'm happy with 2.5

rbooth200 commented 7 years ago

This has failed because there is a ~ 1 per cent random fluctuation in the error norm when running with MPI. The SPH tolerence in the error norm is very close to the actual error in the serial case, so it's causing random fails.

Thoughts?

giovanni-rosotti commented 7 years ago

Do your recent changes to MPI change anything with respect to the small difference in the error norm?

dhubber commented 7 years ago

Do your recent changes to MPI change anything with respect to the small difference in the error norm?

If you mean my changes, they were for a separate issue but they push the error norm just above the threshold as before so it's probably that ghost range issue again. Was this ghost range issue fixed/discussed on another branch/pull request by any chance? If we don't know what the cause is, or how to fix it, the simplest 'fix' is to increase the maximum error norm value for Travis to pass these tests. I don't like that solution particularly but not sure what else to suggest otherwise.

giovanni-rosotti commented 7 years ago

No I meant Richard's recent changes (didn't know you were doing any modification to MPI, David!)

dhubber commented 7 years ago

No I meant Richard's recent changes (didn't know you were doing any modification to MPI, David!)

No, I wasn't doing any direct MPI modifications but the timesteps bug I fixed was a potential MPI bug (in the way the global min/max timesteps were reduced over MPI) so wasn't sure if you were talking about that.

rbooth200 commented 7 years ago

What do you mean by recent? Changing the ghost range doesn't solve the issue. Otherwise my changes in the MPI/Sinks branch only affect sink particles.

dhubber commented 7 years ago

Changing the ghost range doesn't solve the issue.

It doesn't solve the issue but does seem to affect the final error and push it just below the threshold and so passes the Travis tests. So either the ghost range is actually not big enough or there's something else at play (probably the latter). Still scratching my head for ideas to solve it. On the other hand, if it's not an issue that's just with this branch (we had similar issues with other branches I remember), then shouldn't we find a way to finally merge this branch and then deal with this problem on a separate branch?

rbooth200 commented 7 years ago

Actually, I don't think that its the ghost range, the result is quite random after all. Sometimes it passes, sometimes it fails with both values of the ghost range

dhubber commented 7 years ago

I've only ever seen it fail with the smaller value (2.5) though. I agree it's not as simple as that (as I wrote above) and maybe it's a race condition, either in particle ordering of computing forces or an MPI summation? Maybe with MPI it leads sometimes to another h compute leading to slightly different h's and therefore slightly different errors? If it's simply this, the acceptable error tolerance is way to close to the computed tolerance and should be increased. But if it were the case, it would also be nice to try and guarantee reproduceable results (probably impossible with our MPI scheme though).

rbooth200 commented 7 years ago

I definitely remember seeing it fail with both. I don't have any evidence any more though due to a rebase...

rbooth200 commented 7 years ago

The MPI is always going to be a bit non-deterministic due to the load balancing, so small fluctuations have to be expected. I don't know how small though, and ~1% seems on the high side to me so I suspect that there might be a bug somewhere.

dhubber commented 6 years ago

See issue https://github.com/gandalfcode/gandalf/issues/181

gandalfcode / gandalf

Improve openmp2 #157