Major slowdown with MPI

RoguePotato commented 6 years ago

There is around a 4-5 factor slowdown when using 32 cores over 2 nodes compared to 16 cores on a single node. The slowdown persists even for 128 cores across 8 nodes.

This consistently occurs for the Boss Bodenheimer test, the disc test and the freefall test and is independent of particle number. The issue also occurs for custom initial conditions with Nhydro = 10^7.

All tests were run on the DiRAC Complexity cluster at Leicester.

distamio commented 6 years ago

Hey guys, any updates on this issue? Do you have the same problem or maybe it's something we are doing wrong? It's quite a critical issue as we can't really do high-res simulations at the moment.

rbooth200 commented 6 years ago

Hi! So this doesn't agree with the tests we did for the paper, but the scaling MPI is known to be not great. We are going to look into this with the help of DiRAC in the coming months, I'll keep you posted. Any more specific information about the issue would be greatly appreciated - i.e. is this due to load balancing (i.e. are many processors sat idle) or is it something else?

RoguePotato commented 6 years ago

I'll have to go and investigate for more in-depth details. Interestingly, the CPU resource usage seems to give a large difference for OpenMP + MPI run on multiple nodes. For example, the Boss Bodenheimer test with 500,000 particles:

OpenMP (1 node, 16 cores) resources_used.cput=01:52:37 resources_used.mem=603000kb resources_used.vmem=1818968kb

OpenMP + MPI (1 node, 16 cores) resources_used.cput=01:42:33 resources_used.mem=693372kb resources_used.vmem=1173264kb

OpenMP + MPI (2 nodes, 32 cores) resources_used.cput=00:20:06 resources_used.mem=1733500kb resources_used.vmem=3126176kb

OpenMP + MPI (4 nodes, 64 cores) resources_used.cput=00:41:36 resources_used.mem=3585552kb resources_used.vmem=5985768kb

CPU usage drops dramatically between 1 node and 2 nodes. Not sure if this helps, I'll try and investigate more.

giovanni-rosotti commented 6 years ago

Hi Antony and Dimitris, I am sorry to hear about the problems you're having. And apologies for not following up after this thread had been opened! Anthony, to help us diagnose the problem, in these situations it's useful to describe which parameters you are using. Did you just take the default Boss Bodenheimer test, change the number of particles and run it, or did you also change other parameters? In any case, as Richard said, we just got two months of engineering time from DIRAC to improve the MPI scaling. I hope things will improve dramatically over the next months!

Cheers, Giovanni

On Tue, May 1, 2018 at 4:20 PM, Anthony Mercer notifications@github.com wrote:

I'll have to go and investigate for more in-depth details. Interestingly, the CPU resource usage seems to give a large difference for OpenMP + MPI run on multiple nodes. For example, the Boss Bodenheimer test with 500,000 particles:

OpenMP (1 node, 16 cores) resources_used.cput=01:52:37 resources_used.mem=603000kb resources_used.vmem=1818968kb

OpenMP + MPI (1 node, 16 cores) resources_used.cput=01:42:33 resources_used.mem=693372kb resources_used.vmem=1173264kb

OpenMP + MPI (2 nodes, 32 cores) resources_used.cput=00:20:06 resources_used.mem=1733500kb resources_used.vmem=3126176kb

OpenMP + MPI (4 nodes, 64 cores) resources_used.cput=00:41:36 resources_used.mem=3585552kb resources_used.vmem=5985768kb

CPU usage drops dramatically between 1 node and 2 nodes. Not sure if this helps, I'll try and investigate more.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gandalfcode/gandalf/issues/179#issuecomment-385697968, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdSgq7NXXjs8mBK4ZQvz6i4iPbgSoEKks5tuH1bgaJpZM4Rzt_h .

RoguePotato commented 6 years ago

The parameter file I used for the previous comment was this with a modified particle number only. However, using these initial conditions, I get a consistent segfault when, and only when, two or more nodes are utilised. Better than spurious segfaults I suppose. I haven't tried debugging this on my machine yet.

There are bugs on the MPI side of things, and it maybe architecture (or compiler, but somewhat unlikely) specific. This is with the current branch. It is consistent with every initial condition I have thrown at it.

gandalfcode / gandalf

Major slowdown with MPI #179