Closed illia-thiele closed 6 years ago
Hi
As stated in the error message, this should never happen ... Did the simulation which did work correctly with the same input file had a different parallel setup ? Different number of MPI ranks or openMP threads ?
Could you please share your input file so that I can run a few tests on my side ? I assume you are using the latest version of the master branch of this git Hub repo ?
A list of things you can try and let me know if it improves things:
Dear Arnaud,
Thank you very much for the quick answer. I will follow your suggestions.
To answer to your first questions: I have used the same parallel setup in both cases and the master branch. This is my input:
import math import numpy as np
l0 = 2.0 math.pi # wavelength in normalized units t0 = l0 # optical cycle in normalized units rest = 1608.0 # nb of timestep in 1 optical cycle resx = 1600.0 # nb cells in 1 wavelength resy = 100.0 # nb cells in 1 wavelength tc = 2. math.pi # time center for the laser at the boundary L0 = 3.0 l0 # box length Ly0 = 8. np.pi # box width N_patch = 64
xfwhm_e = 0.326943431266 / 4. # fwhm beams length of the electrons vm = 0.95 # electron beam velocity xc = 2.xfwhm_e + 2. # distance to hold from beam xi = 2. / 3. L0 # interaction place between mirror and electron beam xmp = 1.4 xmr = 0.1 xmp n_e = 14.15 4. # peak electorn density gamma_e = 20. # electron gamma factor
vm = np.sqrt(gamma_e**2-1) / gamma_e r_e = np.pi # el. beam FWHM width time_plasma_frozen = tc + L0 - xi - xc / vm + (0.6 + 0.5) / vm
E0 = 1. w0 = np.pi zr = w0**2 / 2. fm = 0.#np.pi # distance between mirror and focal plane
Main( geometry = "2Dcartesian", interpolation_order = 2, # only 2 available
cell_length = [l0/resx,l0/resy],
grid_length = [L0, Ly0],
number_of_patches = [ N_patch , 2],
timestep = t0/rest,
simulation_time = t0/rest * 7001,
EM_boundary_conditions = [
['silver-muller'],
['silver-muller'],
],
random_seed = smilei_mpi_rank,
print_every = int(rest/2.0),
solve_poisson = False
)
LoadBalancing( initial_balance = True, every = 20, cell_load = 1., frozen_particle_load = 0.1 )
Species( name = "electrons", time_frozen = time_plasma_frozen, position_initialization = "random", momentum_initialization = "cold",
particles_per_cell = 100,
mass = 1.,
atomic_number = None,
number_density = gaussian(max=n_e, xfwhm=xfwhm_e, xcenter=xi-xc, xorder=2, yfwhm=r_e, ycenter=Ly0/2., yorder=2),
#charge_density = 3.5/np.sqrt(1-0.95**2),
charge = -1.,
mean_velocity = [vm,0.,0.],
boundary_conditions = [
["remove", "remove"],
["remove", "remove"],
# ["periodic", "periodic"],
],
# thermal_boundary_temperature = None,
# thermal_boundary_velocity = None,
# ionization_model = "none",
# ionization_electrons = None,
is_test = False,
#c_part_max = 1.0,
pusher = "boris",
)
Species( name = 'mirror_eon', position_initialization = 'random', momentum_initialization = 'cold', ionization_model = 'none', particles_per_cell = 100,
mass = 1.0,
charge = -1.0,
number_density = trapezoidal(10000., xvacuum=xi - xmp - 2.*xmr, xslope1=xmr, xplateau=xmp, xslope2=xmr, yvacuum=Ly0/10., yslope1=Ly0/10., yplateau=0.6*Ly0, yslope2=Ly0/10.),
time_frozen = time_plasma_frozen,
boundary_conditions = [
["remove", "remove"],
["remove", "remove"],
# ["periodic", "periodic"],
# ["periodic", "periodic"],
],
)
LaserGaussian2D( box_side = "xmax", a0 = E0, omega = 1., focus = [L0-xi+fm, Ly0/2.], waist = w0, incidence_angle = 0., polarization_phi = np.pi/2., ellipticity = 0., time_envelope = tgaussian(start=0.,fwhm=np.pi/np.sqrt(2.), center=tc) )
MovingWindow( time_start = tc + L0 - xi + tc / 2., velocity_x = 1., )
DiagScalar( every = 10 )
DiagFields(#0 every = 100, flush_every = 1000, fields = ['Ex','Ey','Ez','Bz_m','By_m','Bz_m','Jx','Jy','Jz','Rho_electrons','Rho_mirror_eon'] )
Moreover, I have used the following setup:
You're using hyperthreading. We strongly advise against this. It won't explain the crash though.
Thanks for sharing, I'll try to get back to you.
Something is not right in your input file in the number density of mirror electron: xvacuum=xi - xmp - 2.xmr
. Do you mean 2*xmr
?
Both #SBATCH --threads-per-core=2 and #SBATCH --threads-per-core=1 give the error sometimes.
Yes, I mean 2.xmr. The editor has removed all the signs like if spaces where missing.
Ok according to my few basic tests my inital feelings seem to be confirmed. You are running the code with very few patches per MPI rank and a very unbalanced setup. You end up having a single patch per MPI rank, or the code tries to exchange patches such as this happens, which the code is supposed to be able to handle but obviously does not in your case.
I'll try to understand and fix the code to avoid the crash. Nevertheless, understand that you will be in much better conditions using much more patches per MPI rank. That can easily be done simply by using less MPI ranks and more openMP threads (option 2), or using smaller patches (option 1), or both.
Also note that having a large discrepancy between the number of patches in each dimension is not recommended either.
The crash can be triggered either by the moving window or the dynamic load balancing that's why I suggested option 3.
I hope this helps !
Thank you very much! I will follow your suggestions. A few new simulations have already finished property.
No news so I assume good news and close the issue.
An additional safety measure has been implemented in order to mitigate this error and make the code more robust. It will be available in the next release. Nevertheless, this situation is sub-optimal and users should try to avoid it by favoring openMP over MPI decomposition and by avoiding too large patches.
Dear SMILEI community,
I have an error which appears on the cluster occigen. During the runtime the following message is given in the output file: 4020/6501 1.5710e+01 1.1425e+03 ( 8.0868e+02 ) ERROR src/Patch/VectorPatch.cpp:1323 (createPatches) No patch to clone. This should never happen!
I had two simulations with exactly the same input file, but one had the error and another finished property. Have you ever seen anything similar? Do you have an idea about what might be wrong?
This is the whole output file:
Reading the simulation parameters
HDF5 version 1.8.18 Python version 2.7.12 Parsing pyinit.py Parsing v3.4-71-g2866a41-master Parsing pyprofiles.py Parsing THzEbeamAmp2dMirror_E0z1d0_tp2d35_wp3d14_nm10000_xmp1d4_gamma20_n28d3_tn0d1_tdm0d6_qneutr_100ppc_4.py Parsing pycontrol.py Calling python _smilei_check Calling python _prepare_checkpoint_dir [WARNING] Change patches distribution to hilbert [WARNING] Particles cluster width set to : 5
Geometry: 2Dcartesian
Load Balancing:
Initializing MPI
OpenMP
Initializing the restart environment
Initializing moving window
Initializing particles & fields
Initializing Patches
Creating Diagnostics, antennas, and external fields
Applying external fields at time t = 0
Solving Poisson at time t = 0
Time in Poisson : 0.00
Initializing diagnostics
Running diags at time t = 0
Species creation summary
Memory consumption
Expected disk usage (approximate)
Cleaning up python runtime environement
Time-Loop started: number of time-steps n_time = 6501
1608/6501 6.2851e+00 1.1969e+02 ( 5.1892e+01 ) 2412/6501 9.4267e+00 1.6992e+02 ( 5.0226e+01 ) 3216/6501 1.2568e+01 3.3386e+02 ( 1.6394e+02 )