SWIFTSIM / SWIFT

Modern astrophysics and cosmology particle-based code. Mirror of gitlab developments at https://gitlab.cosma.dur.ac.uk/swift/swiftsim
http://www.swiftsim.com
GNU Lesser General Public License v3.0
88 stars 58 forks source link

Stuck in EAGLE 50 #12

Closed JanKleine closed 5 years ago

JanKleine commented 5 years ago

I was trying to run the Eagle 50 example (swiftsim/examples/EAGLE_low_z/EAGLE_50) as part of preparing for a competition. All tests pass, but the simulation seems to be stuck before even starting properly (cpu utilisation goes to 1-2 cores per rank, and nothing changes after running like this for 12h). I used --cosmology --hydro --self-gravity --stars --threads=16 -n 64 --pin

.
.
.
[0000] [00141.5] engine_config: Absolute minimal timestep size: 6.613473e-19
[0000] [00120.4] engine_config: Minimal timestep size (on time-line): 8.189315e-11
[0000] [00120.4] engine_config: Maximal timestep size (on time-line): 5.366949e-06
[0000] [00121.7] engine_config: Restarts will be dumped every 6.000000 hours
[0000] [00121.7] main: engine_init took 1279.084 ms.
[0000] [00121.7] main: Running on 404421250 gas particles, 20786477 stars particles and 425259008 DM particles (850466735 gravity particles)
[0000] [00121.7] main: from t=1.276e-02 until t=1.413e-02 with 4 ranks, 16 threads / rank and 16 task queues / rank (dt_min=1.000e-10, dt_max=1.000e-05)...
[0000] [00164.0] engine_init_particles: Setting particles to a valid state...
[0000] [00165.2] engine_init_particles: Computing initial gas densities.
[0000] [00270.9] engine_init_particles: Converting internal energy variable.
[0000] [00271.4] engine_init_particles: Running initial fake time-step.
#   Step           Time Scale-factor     Redshift      Time-step Time-bins      Updates    g-Updates    s-Updates  Wall-clock time [ms]  Props

full output

Sometimes it gets stuck earlier (at main: from t=1.276e-02 until t=1.413e-02 with 4 ranks...) but I've never gotten it further than what's shown above.

What am I doing wrong?

MatthieuSchaller commented 5 years ago

Hi, what MPI version and fabric are you using? We have seen some implementations not behaving correctly.

JanKleine commented 5 years ago

I'm using openmpi 3.1.3 with InfiniBand. Thanks for the quick response.

MatthieuSchaller commented 5 years ago

Ok. So that seems fine. And what version of the OFED driver are you using? The regular Linux-kernel one or the Mellanox-optimised?

MatthieuSchaller commented 5 years ago

Also, what transport library are you using in OpenMPI?

We recommend psm and not psm2. That is running with --mca btl vader,self --mca mtl psm.

JanKleine commented 5 years ago

Ok. So that seems fine. And what version of the OFED driver are you using? The regular Linux-kernel one or the Mellanox-optimised?

I think the Mellanox-optimised version

We recommend psm and not psm2. That is running with --mca btl vader,self --mca mtl psm.

I will try that

MatthieuSchaller commented 5 years ago
 Ok. So that seems fine. And what version of the OFED driver are you using? The regular Linux-kernel one or the Mellanox-optimised?

I think the Mellanox-optimised version

Right, then that is likely the issue. Their curent driver hangs if too many asynchronous communications are in-flight at a given point in time. SWIFT makes extensive use of this mechanism so you may be facing this issue here.

JanKleine commented 5 years ago

Would removing the Mellanox driver fix the issue?

MatthieuSchaller commented 5 years ago

I can only speculate as I have never seen this issue on machines where we have control over things but it may help.

Otherwise, trying a different mtl in OpenMPI might help (instead of changing driver).

JanKleine commented 5 years ago

Right, then that is likely the issue. Their curent driver hangs if too many asynchronous communications are in-flight at a given point in time. SWIFT makes extensive use of this mechanism so you may be facing this issue here.

I seem to get the same problem on a system with the regular driver.

MatthieuSchaller commented 5 years ago

Did you try changing the mtl to psm?

MatthieuSchaller commented 5 years ago

Hi Jan,

Have you had some luck with the code?

JanKleine commented 5 years ago

I'm having some trouble as the OpenMPI version I'm using apparently doesn't support psm and installing it with psm is a little problematic at the moment, but I'm still on it.

JanKleine commented 5 years ago

I'm also using Slurm as job scheduler, I forgot to mention that earlier. I hope that that is not interfering with anything.

JanKleine commented 5 years ago

Did you try changing the mtl to psm?

~using psm did not seem to resolve the issue.~

Edit: I made another mistake while running. Using psm does seem to resolve the problem