SWIFTSIM / SWIFT

Modern astrophysics and cosmology particle-based code. Mirror of gitlab developments at https://gitlab.cosma.dur.ac.uk/swift/swiftsim
http://www.swiftsim.com
GNU Lesser General Public License v3.0
88 stars 58 forks source link

MPI issue at start #26

Closed FHusko closed 2 years ago

FHusko commented 2 years ago

Hi, I have been trying a custom hydrostatic halo run that seems to fail over MPI (but runs happily with a single node) at the beginning even with simple physics. The error I get is:

[0001] [00009.4] scheduler.c:scheduler_addunlock():106: Unlocking task is NULL.
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
[0001] [00009.4] scheduler.c:scheduler_addunlock():106: Unlocking task is NULL.
[0001] [00009.4] scheduler.c:scheduler_addunlock():106: Unlocking task is NULL.
[0001] [00009.4] scheduler.c:scheduler_addunlock():106: Unlocking task is NULL.
[0001] [00009.4] scheduler.c:scheduler_addunlock():106: Unlocking task is NULL.
[0001] [00009.4] scheduler.c:scheduler_addunlock():106: Unlocking task is NULL.
[0001] [00009.4] scheduler.c:scheduler_addunlock():106: Unlocking task is NULL.
[0001] [00009.4] scheduler.c:scheduler_addunlock():106: Unlocking task is NULL.

The failure happens at

[0000] [00006.2] engine_init_particles: Setting particles to a valid state...
[0000] [00006.3] engine_init_particles: Computing initial gas densities and approximate gravity.
[0000] [00006.3] space_rebuild: (re)building space

The configure command is --with-subgrid=EAGLE-XL --with-hydro=sphenix --with-kernel=wendland-C2 --enable-debug --enable-debugging-checks --disable-optimization

and the submission script

#!/bin/bash
#SBATCH -J swift
#SBATCH -N 2
#SBATCH --tasks-per-node=2
#SBATCH -o outFile.out
#SBATCH -e errFile.err
#SBATCH -p cosma6
#SBATCH -A dp004
#SBATCH -t 72:00:00

module unload gnu_comp intel_comp intel_mpi ucx parmetis parallel_hdf5 fftw gsl llvm

module load intel_comp/2021.1.0 compiler
module load intel_mpi/2018
module load ucx/1.10.1
module load fftw/3.3.9cosma7 # or fftw/3.3.9 on cosma 5 & 6
module load parallel_hdf5/1.10.6 parmetis/4.0.3-64bit gsl/2.5

mpirun -np 4 /cosma/home/durham/dc-husk1/SWIFT_spin_bh_new/swiftsim/examples/swift_mpi --hydro --temperature --threads=8 --limiter --sync --pin  isolated_galaxy.yml

As you can see, this is even without gravity. I have tried the example hydrostatic halo setup supplied with the code, that one worked with 2 nodes. So I'm thinking that this has to be related to the initial conditions. But the confusing thing is that the same setup worked with an older version of SWIFT (10 months old). And the same setup works on 1 node with this version.

Would appreciate any help with this. Thanks!

bwvdnbro commented 2 years ago

Sounds like a badly handled corner case. We did introduce some extra dependencies recently, so that could explain why the older version works.

Any chance you could run this through a debugger and send me a stack trace for the point where it crashes (you can set a breakpoint on scheduler.c:106)?

Or could you make the code change in the patch below and let me know what the error message becomes? unlock_null.txt

FHusko commented 2 years ago

Hi Bert, this is the new error message: [0003] [00010.3] scheduler.c:scheduler_addunlock():111: Unlocking task is NULL (task unlocks send/tend). Thanks for being on the case!

bwvdnbro commented 2 years ago

@MatthieuSchaller looks like this is the only place where scheduler_addunlock() could be called with a first argument that is NULL and a second argument that is a send/tend: https://github.com/SWIFTSIM/swiftsim/blob/67e0fdc56c335fb75fddb33435f8630f5a5ea74b/src/engine_maketasks.c#L3961

Is this a realistic scenario, or does that mean something else went wrong? Is ci->timestep_collect guaranteed to exist?

MatthieuSchaller commented 2 years ago

I wonder whether there is something wrong in the case where some TLCs are completely empty, in which case timestep_collect could have been missed. Filip's setup is a zoom(-ish) so maybe there is something I did not think of when changing the dt exchange.

FHusko commented 2 years ago

I don't know how it compares to a typical zoom, but the particle masses start growing as r^2 after 500 kpc, and the halo extends out to 6000 kpc. On top of that, the mass density in the setup falls as r^1.5. So the number density of particles falls as r^3.5 in total throughout most of the box.

MatthieuSchaller commented 2 years ago

You can try the branch Filip_fix in the gitlab. It should now work. If you confirm it does, then I'll clean it up and make it a permanent change.

FHusko commented 2 years ago

It does indeed! Have tested it out on the smaller test problem, and have also begun the actual larger run (36 nodes) which prompted this in the first place. That one works too.

Thanks again!

MatthieuSchaller commented 2 years ago

Great, thanks for checking. I'll write this up as a proper clean fix and we'll merge it into the main code.

MatthieuSchaller commented 2 years ago

I have now pushed a cleaner version. Just to be extra safe, could you pull the latest version of this branch and test it once more? If it starts smoothly that will be enough.

Thanks!

FHusko commented 2 years ago

I applied the most recent changes. I get a following error now with the test case:

[0000] [00167.9] stars_spart_has_no_neighbours: WARNING: Star particle with ID 1000008101 treated as having no neighbours (h: 225, wcount: 0).
[0000] [00167.9] ./feedback/EAGLE_thermal/feedback.h:feedback_prepare_feedback():211: Evolving a star particle that should not!

This happens in the initial fake time step. There are some stellar particles in the initial conditions which are probably prompting this error. I got the error on Friday with the earlier set of changes which you had made. I don't know how I managed to get it to run earlier, since I remember the run did happily go for around 20 minutes. Possibly it was because I didn't turn on the --stars and --feedback options with the earlier test run. Also this may not have anything to do with the changes you had made; it could be from some change along the way (I was using a year-old version of SWIFT).

If you don't have a clear idea where this is coming from, I could try my setup with the current master as well, just to see if this is related to the latest changes.

MatthieuSchaller commented 2 years ago

mmmh.... Both should be unrelated to the changes here.

The first message is probably because there is a star somewhere far from everything and limited by h_max. That's a relatively new warning. But the code should survive.

I guess the problem here is that the setup is problematic in terms of some of the stars. Can you print out more information about that star and see whether it is indeed at a strange place?

FHusko commented 2 years ago

Ah, yes, there are some stars that are placed very far away from the centre. This was one of them (at a dozen Mpc away from the centre of mass, according to a check I did now). This happens because the script which creates the stellar bulge places stars by drawing random numbers from a distribution, without any maximum radius.

I'll try cutting off the stars at around 100 kpc or something similar, see if I get the same thing.

MatthieuSchaller commented 2 years ago

Can you give the stars far away an age of -1 ?

FHusko commented 2 years ago

That's the odd thing, the stars in the ICs should have a birth time of -1. Could it be that overwrite_birth_time: 1 and birth time: -1 in the stars section of the parameter file is not enough with the newer version of SWIFT? Although I don't see any other ones that could affect this.

MatthieuSchaller commented 2 years ago

No, that hasn't changed. Maybe there is something fishy in the logic.

That should be the same without MPI however.

FHusko commented 2 years ago

Yes, the error happens without MPI too with the newer version of SWIFT.

MatthieuSchaller commented 2 years ago

Can you show me the full log to that point?

FHusko commented 2 years ago

Here it is:

====
Starting job 4900398 at Mon 21 Feb 16:42:42 GMT 2022 for user dc-husk1.
Running on nodes: m6320
====
 Welcome to the cosmological hydrodynamical code
    ______       _________________
   / ___/ |     / /  _/ ___/_  __/
   \__ \| | /| / // // /_   / /
  ___/ /| |/ |/ // // __/  / /
 /____/ |__/|__/___/_/    /_/
 SPH With Inter-dependent Fine-grained Tasking

 Version : 0.9.0
 Revision: v0.9.0-844-g2b88a89d-dirty, Branch: master, Date: 2022-02-04 10:50:57 +0000
 Webpage : www.swiftsim.com

 Config. options: '--with-subgrid=EAGLE-XL --with-hydro=sphenix --with-kernel=wendland-C2 --with-ext-potential=nfw --enable-fixed-boundary-particles=2 --with-parmetis --with-tbbmalloc --disable-optimization --enable-debug --enable-debugging-checks'

 Compiler: ICC, Version: 20.21.20201112
 CFLAGS  : '-g -O0  -debug inline-debug-info -pthread -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -w2 -Wunused-variable -Wshadow -Werror -Wstrict-prototypes'

 HDF5 library version     : 1.10.6
 FFTW library version     : 3.x (details not available)
 GSL library version      : 2.5

[00000.0] main: CPU frequency used for tick conversion: 2599962789 Hz
[00000.0] main: Running on: m6320.pri.cosma7.alces.network
[00000.0] main: WARNING: Debugging checks activated. Code will be slower !
[00000.0] main: sizeof(part)        is  320 bytes.
[00000.0] main: sizeof(xpart)       is  160 bytes.
[00000.0] main: sizeof(sink)        is   96 bytes.
[00000.0] main: sizeof(spart)       is  448 bytes.
[00000.0] main: sizeof(bpart)       is 4128 bytes.
[00000.0] main: sizeof(gpart)       is  152 bytes.
[00000.0] main: sizeof(multipole)   is  200 bytes.
[00000.0] main: sizeof(grav_tensor) is  168 bytes.
[00000.0] main: sizeof(task)        is   96 bytes.
[00000.0] main: sizeof(cell)        is 1376 bytes.
[00000.0] main: Reading runtime parameters from file 'isolated_galaxy.yml'
[00000.1] output_options_init: Reading select output parameters from file 'param_list.yml'
[00000.1] io_prepare_output_fields: WARNING: Trying to change behaviour of field 'Default:FOFGroupIDs_Gas' (read from 'param_list.yml') that does not exist. This may be because you are not running with all of the physics that you compiled the code with.
[00000.1] io_prepare_output_fields: WARNING: Trying to change behaviour of field 'Default:VELOCIraptorGroupIDs_Gas' (read from 'param_list.yml') that does not exist. This may be because you are not running with all of the physics that you compiled the code with.
[00000.1] main: Internal unit system: U_M = 1.988480e+43 g.
[00000.1] main: Internal unit system: U_L = 3.085660e+21 cm.
[00000.1] main: Internal unit system: U_t = 3.085660e+16 s.
[00000.1] main: Internal unit system: U_I = 1.000000e+00 A.
[00000.1] main: Internal unit system: U_T = 1.000000e+00 K.
[00000.1] phys_const_print:    Gravitational constant = 4.301093e+04
[00000.1] phys_const_print:            Speed of light = 2.997925e+05
[00000.1] phys_const_print:           Planck constant = 1.079908e-96
[00000.1] phys_const_print:        Boltzmann constant = 6.943238e-70
[00000.1] phys_const_print:     Thomson cross-section = 6.986924e-68
[00000.1] phys_const_print:             Electron-Volt = 8.057293e-66
[00000.1] phys_const_print:               Proton mass = 8.411560e-68
[00000.1] phys_const_print:                      Year = 1.022696e-09
[00000.1] phys_const_print:         Astronomical Unit = 4.848164e-09
[00000.1] phys_const_print:                    Parsec = 1.000006e-03
[00000.1] phys_const_print:                Solar mass = 9.999648e-11
[00000.1] phys_const_print:    H_0 / h = 100 km/s/Mpc = 9.999943e-02
[00000.1] phys_const_print:                    T_CMB0 = 2.725500e+00
[00000.4] feedback_props_init: Feedback model is EAGLE (EAGLE)
[00000.4] feedback_props_init: Feedback energy fraction min=1.000000, max=1.000000
[00000.4] feedback_props_init: Feedback energy fraction powers: n_n=0.868600, n_Z=0.868600
[00000.4] feedback_props_init: Feedback energy fraction widths: s_n=0.499994, s_Z=0.499994
[00000.4] feedback_props_init: Feedback energy fraction pivots: Z_0=0.001266, n_0_cgs=1.458800
[00012.5] read_cooling_tables: Done reading in general cooling table
[00012.5] cooling_print_backend: Cooling function is 'COLIBRE'.
[00012.5] starformation_print_backend: Star formation model is EAGLE
[00012.5] starformation_print_backend: Density threshold uses subgrid quantities
[00012.5] starformation_print_backend: Particles are star-forming if their properties obey (T_sub < 1.000000e+03 K OR (T_sub < 3.162200e+04 K AND n_H,sub > 1.000000e+01 cm^-3))
[00012.5] starformation_print_backend: Star formation law is a pressure law (Schaye & Dalla Vecchia 2008):
[00012.5] starformation_print_backend: With properties: normalization = 1.515000e-04 Msun/kpc^2/yr, slope of theKennicutt-Schmidt law = 1.400000e+00 and gas fraction = 1.000000e+00
[00012.5] starformation_print_backend: At densities of 1.000000e+03 H/cm^3 the slope changes to 2.000000e+00.
[00012.5] starformation_print_backend: Running with a direct conversion density of: 3.402823e+38 #/cm^3
[00012.5] chemistry_print_backend: Chemistry model is 'EAGLE' tracking 9 elements.
[00012.5] main: Reading ICs from file 'ICs.hdf5'
[00012.5] io_read_unit_system: Reading IC units from ICs.
[00012.5] read_ic_single: Conversion needed from:
[00012.5] read_ic_single: (ICs) Unit system: U_M =      1.988480e+43 g.
[00012.5] read_ic_single: (ICs) Unit system: U_L =      3.085678e+21 cm.
[00012.5] read_ic_single: (ICs) Unit system: U_t =      3.085678e+16 s.
[00012.5] read_ic_single: (ICs) Unit system: U_I =      1.000000e+00 A.
[00012.5] read_ic_single: (ICs) Unit system: U_T =      1.000000e+00 K.
[00012.5] read_ic_single: to:
[00012.5] read_ic_single: (internal) Unit system: U_M = 1.988480e+43 g.
[00012.5] read_ic_single: (internal) Unit system: U_L = 3.085660e+21 cm.
[00012.5] read_ic_single: (internal) Unit system: U_t = 3.085660e+16 s.
[00012.5] read_ic_single: (internal) Unit system: U_I = 1.000000e+00 A.
[00012.5] read_ic_single: (internal) Unit system: U_T = 1.000000e+00 K.
[00012.5] ic_info_read_hdf5: Metadata group ICs_parameters not found in ICs file
[00029.4] main: Reading initial conditions took 16905.954 ms.
[00030.6] part_verify_links: All links OK
[00030.6] part_verify_links: took 962.443 ms.
[00030.6] main: Read 13035246 gas particles, 0 sink particles, 93750 star particles, 1 black hole particles, 0 DM particles, 0 DM background particles, and 0 neutrino DM particles from the ICs.
[00030.6] space_init: Imposing a star smoothing length of 1.050000e+00
[00032.0] space_regrid: (re)griding space cdim=(8 8 8)
[00032.4] main: space_init took 1772.770 ms.
[00032.5] potential_print_backend: External potential is 'NFW' with properties are (x,y,z) = (2.054982e+04, 2.054982e+04, 2.054982e+04), scale radius = 5.157065e+02 timestep multiplier = 1.500000e-02, mintime = 6.672009e-04
[00032.5] potential_print_backend: Properties of the halo M200 = 1.000000e+05, R200 = 2.062826e+03, c = 4.000000e+00
[00032.5] main: space dimensions are [ 41099.648 41099.648 41099.648 ].
[00032.5] main: space isn't periodic.
[00032.5] main: highest-level cell dimensions are [ 8 8 8 ].
[00032.5] main: 13035246 parts in 512 cells.
[00032.5] main: 13128997 gparts in 512 cells.
[00032.5] main: 0 sinks in 512 cells.
[00032.5] main: 93750 sparts in 512 cells.
[00032.5] main: 1 bparts in 512 cells.
[00032.5] main: maximum depth is 0.
[00032.5] engine_init: took 0.324 ms.
[00032.5] engine_config: Running simulation 'IsolatedGalaxy-EAGLE-Ref'.
[00032.5] engine_config: prefer NUMA-distant CPUs
[00032.5] engine_init: cpu map is [ 0 8 1 9 2 10 3 11 4 12 5 13 6 14 7 15 16 24 17 25 18 26 19 27 20 28 21 29 22 30 23 31 ].
[00034.0] engine_policy: engine policies are [  'steal'  'keep'  'numa affinity'  'hydro'  'self gravity'  'external gravity'  'cooling'  'stars'  'star formation'  'feedback'  'black holes'  'time-step limiter'  'time-step sync'  ]
[00034.0] eos_print: Equation of state: Ideal gas.
[00034.0] eos_print: Adiabatic index gamma: 1.666667.
[00034.0] pressure_floor_print: Pressure floor is 'none'
[00034.0] hydro_props_print: Hydrodynamic scheme: SPHENIX (Borrow+ 2020) in 3D.
[00034.0] hydro_props_print: Hydrodynamic kernel: Wendland C2 with eta=1.234800 (57.27 neighbours).
[00034.0] hydro_props_print: Hydrodynamic relative tolerance in h: 0.00010 (+/- 0.0172 neighbours).
[00034.0] hydro_props_print: Hydrodynamic integration: CFL parameter: 0.2000.
[00034.0] hydro_props_print: Hydrodynamic integration: Max change of volume: 1.40 (max|dlog(h)/dt|=0.112157).
[00034.0] hydro_props_print: Neighbour number definition: Unweighted.
[00034.0] hydro_props_print: Maximal smoothing length allowed: 225.0000
[00034.0] hydro_props_print: Maximal time-bin difference between neighbours: 2
[00034.0] hydro_props_print: Minimal gas temperature set to 100.000000
[00034.0] hydro_props_print: No particle splitting
[00034.0] viscosity_print: Artificial viscosity parameters set to alpha: 0.100, max: 2.000, min: 0.000, length: 0.050.
[00034.0] diffusion_print: Artificial diffusion parameters set to alpha: 0.000, max: 1.000, min: 0.000, beta: 1.000.
[00034.0] entropy_floor_print: Entropy floor is 'EAGLE' with:
[00034.0] entropy_floor_print: Jeans limiter with slope n=1.333 at rho=3.286268e-07 (1.000000e-04 H/cm^3) and T=800.0 K
[00034.0] entropy_floor_print:  Cool limiter with slope n=1.000 at rho=3.286268e-08 (1.000000e-05 H/cm^3) and T=10.0 K
[00034.0] gravity_props_print: Self-gravity scheme: With per-particle softening
[00034.0] gravity_props_print: Self-gravity scheme: FMM-MM with m-poles of order 4
[00034.0] gravity_props_print: Self-gravity time integration: eta=0.0250
[00034.0] gravity_props_print: Self-gravity opening angle scheme:  fixed
[00034.0] gravity_props_print: Self-gravity opening angle:  theta_cr=0.7000
[00034.0] gravity_props_print: Self-gravity softening functional form: Wendland-C2
[00034.0] gravity_props_print: Self-gravity DM comoving softening: epsilon=3.150000 (Plummer equivalent: 1.050000)
[00034.0] gravity_props_print: Self-gravity DM maximal physical softening:    epsilon=3.150000 (Plummer equivalent: 1.050000)
[00034.0] gravity_props_print: Self-gravity baryon comoving softening: epsilon=3.150000 (Plummer equivalent: 1.050000)
[00034.0] gravity_props_print: Self-gravity baryon maximal physical softening:    epsilon=3.150000 (Plummer equivalent: 1.050000)
[00034.0] gravity_props_print: Self-gravity neutrino DM comoving softening: epsilon=0.000000 (Plummer equivalent: 0.000000)
[00034.0] gravity_props_print: Self-gravity neutrino DM maximal physical softening:    epsilon=0.000000 (Plummer equivalent: 0.000000)
[00034.0] gravity_props_print: Self-gravity mesh side-length: N=0
[00034.0] gravity_props_print: Self-gravity mesh smoothing-scale: a_smooth=0.000000
[00034.0] gravity_props_print: Self-gravity distributed mesh enabled: 0
[00034.0] gravity_props_print: Self-gravity tree cut-off ratio: r_cut_max=0.000000
[00034.0] gravity_props_print: Self-gravity truncation cut-off ratio: r_cut_min=0.000000
[00034.0] gravity_props_print: Self-gravity mesh truncation function: Gadget-like (using erfc())
[00034.0] gravity_props_print: Self-gravity tree update frequency: f=0.010000
[00034.0] stars_props_print: Stars kernel: Wendland C2 with eta=1.164200 (48.00 neighbours).
[00034.0] stars_props_print: Stars relative tolerance in h: 0.00700 (+/- 1.0150 neighbours).
[00034.0] stars_props_print: Stars integration: Max change of volume: 1.40 (max|dlog(h)/dt|=0.112157).
[00034.0] stars_props_print: Maximal iterations in ghost task set to 30
[00034.0] stars_props_print: Stars' birth time read from the ICs will be overwritten to -1.000000
[00034.0] stars_props_print: Stars' age threshold for unlimited dt: 0.000000e+00 [U_t]
[00034.0] stars_props_print: Stars' young/old age threshold: 1.022718e-02 [U_t]
[00034.0] stars_props_print: Max time-step size of young stars: 1.022718e-04 [U_t]
[00034.0] stars_props_print: Max time-step size of old stars: 1.022718e-03 [U_t]
[00034.0] engine_config: Absolute minimal timestep size: 1.110223e-16
[00034.0] engine_config: Minimal timestep size (on time-line): 7.105427e-15
[00034.0] engine_config: Maximal timestep size (on time-line): 7.812500e-03
[00034.3] engine_config: Restarts will be dumped every 4.000000 hours
[00034.3] engine_config: Using 8 threads in the thread-pool
[00034.3] engine_config: took 1832.496 ms.
[00034.3] main: Running on 13035246 gas particles, 0 sink particles, 93750 stars particles 1 black hole particles, 0 neutrino particles, and 0 DM particles (13128997 gravity particles)
[00034.3] main: from t=0.000e+00 until t=1.600e+01 with 1 ranks, 8 threads / rank and 8 task queues / rank (dt_min=1.000e-14, dt_max=1.000e-02)...
[00034.3] engine_init_particles: Setting particles to a valid state...
[00035.2] engine_init_particles: Computing initial gas densities and approximate gravity.
[00035.2] space_rebuild: (re)building space
[00279.8] engine_init_particles: Converting internal energy variable.
[00280.1] engine_init_particles: Running initial fake time-step.
[00280.1] space_rebuild: (re)building space
[00466.6] ./feedback/EAGLE_thermal/feedback.h:feedback_prepare_feedback():211: Evolving a star particle that should not!
/var/slurm/slurmd/job4900398/slurm_script: line 37: 19234 Aborted                 /cosma/home/durham/dc-husk1/SWIFT_spin_bh_new/swiftsim/examples/swift --stars --star-formation --feedback --external-gravity --self-gravity --hydro --cooling --black-holes --threads=8 --limiter --sync --pin isolated_galaxy.yml
MatthieuSchaller commented 2 years ago

Can you point me to the example on cosma? Or, better, to a smaller one that has the same issue?

FHusko commented 2 years ago

It turns out I had my changes to black hole physics incorrectly implemented in the newer version. Now that I corrected it, the error no longer appears. The weird thing was that the error was related to thermal feedback from stars, and none of my changes to black hole physics relate at all to stars. So I hadn't thought that that may even be an issue.

But sorry for wasting your time with that! The big run is now going happily. With intel_mpi/2020 I was getting an inexplicable MPI error after 2-3 days. I am now using intel_mpi/2018 and the newer version of SWIFT. Hopefully that will avoid the problem. But if I get it again, should I report it here or open a new issue?

MatthieuSchaller commented 2 years ago

Ok, so everything solved?

intel_mpi/2020 is buggy, so that's not a swift problem. Not much we can do about it unfortunately. On cosma, use either intel-mpi 2018 or Openmpi 4.x.y.

FHusko commented 2 years ago

Yes, everything solved in terms of this problem. Thanks!