Closed Tissot11 closed 5 years ago
Your error is probably related to the first warning : CFL is not satisfied.
Concerning dumps, I will write something better soon. You must use the same input, except the Checkpoint part.
I'm not sure if it's due to CFL condition. Since I have the same warning for 512 MPI processes and the simulation is running fine till now. This CFL issue arose since today I enlarged the size in x-direction while keeping the time-step same. Last night when I tried to run on 1024 MPI processes for smaller x-size, I had no warning about CFL issue but the stderr file had the same entries. I hope it's not connected with the HDF5 issue that we discussed in other thread #144
About restarting the simulation, one should have checkpoint block with _restartdir option specified to load the simulation stage from the previous simulation dump? If one doesn't use the checkpoint part then how I'll use the dump from previous run?
The CFL problem is due to a timestep too large. This typically crashes the code. It is a very bad sign to have this warning, unless you have very special conditions.
For checkpoints, you need a first simulation with checkpoints already defined, including some dump_step or dump_minutes. This ensures that the whole simulation is stored at some point. The second simulation must have the same input, except that it requires restart_dir to point at the previous simulation directory.
Thanks for clarifying about the restart dump! About CFL as said before I attach std files for the same simulation with adjusted timestep. Now I have no CFL warning but the simulation can not run for 1024 MPI processes but it runs fine for 512 MPI processes. The reason is that it's taking more than a week and it hasn't finished it yet on 512 MPI processes. So I wanted to launch it on 1024 so that it finishes fast. But it's precluding me from doing so.
I'm trying to restart a simulation that was aborted. I ran it on 512 MPI processes and kept 2 dumps. I want to restart from 000001 dump. I can see all 512 dump files in the Checkpoints directory. However when I run using the same Namelist from another directory it complains that it can't find all the files. So I try to loop over in Checkpoints block but it complains that it can not parse the text. I paste below the Checkpoints block. Please suggest how should I proceed?
Checkpoints( restart_dir = "../LS_CP_150_RR_D/checkpoints/", for i in range(511): restart_number = "00001-000000000"+str(i),
#dump_minutes = 240.,
#dump_deflate = 0,
exit_after_dump = False,
#keep_n_dumps = 2,
)
Hi
The restart_dir
should be your simulation directory not the checkpoint directory. Also do not forget to keep your dump_step or dump_minutes argument up otherwise the restarted simulation won't drop any checkpoints any more and you won't be able to run a third restart.
Checkpoints( restart_dir = "../LS_CP_150_RR_D", dump_step = 5700, dump_minutes = 240., exit_after_dump = False, keep_n_dumps = 2, )
Thanks for the quick reply! Now the simulation starts but I have an additional question. This simulation was aborted before the last time step. So the restarting should be from the last time step it dumped the file? I look at the stdout file of the previous simulation run (see the attached picture), where I see at t=51300 the dump is written as 0 and I hope it should start from there?
The restart should happen at the most recent time when a dump occurred unless the restart number was specified.
You can easily see that when the new simulation starts
Ok. Then it should happen at t=51300. I'm a bit nervous since I do not see yet the entries in output (see figure attached). I have used your block for checkpoints.
UPDATE: job has now been completed.
Can you please also have a look at the issue regarding the 1024 MPI processes that I had posted three days ago in this thread and attached the files in archive.zip?
Concerning the restarts, note that you can change the simulation duration if you need it longer.
Concerning you previous crash, the error is very obscure and there is nothing that we could really conclude from it. It says something about a network error, and that would be unrelated to Smilei. However, I can't really tell from the log. As it happens from the point where MPI is initialized, I would say this is a node failure. Have you tried again ?
It's good to know that on restart, one can change the duration. It's a nice feature and I'll be using it often.
I had tried launching on 1024 MPI processes three days ago and I can try it again today. There might have been a network issue on our cluster. What I find also surprising that this restart took few minutes to finish while running this simulation last week it was stuck on the last step for more than 30 hours... Based on the issue I discussed in thread #144, it appears to me that we have some strange problems with HDF5 on our cluster, specially when it comes to finalising the writing and reading of the HDF5 files. Can you please tell me which OS, and which version of HDF5 you use for SMILEI?
We run smilei on many machine. Often RedHat Enterprise Linux, but this does not matter.
HDF5 1.8.16 or 1.10 work fine.
Filesystem is usually lustre.
Ok. Thanks. Which version of openmpi you use?
In most supercomputers we have used Intel mpi 2017 to 2019.
We have used openmpi 2 I believe, but not recently
@Tissot11 Did you manage to confirm your previous crash was related to your machine ?
No. I need to run new simulations since I continued to have the same issues. Now I've complied the current version of SMILEI with the older version of hdf5 1.8.21. But I need to desperately run some other simulations and strangely I can't finish them even in a week. Our SysAdmin suspects some problems with SMILEI and I feel all these issues are connected with each other. If you have time then can you please try the Namelist I attach here? This simulation if I run for plasma density _400 nc then it finishes in 6 hours but if I change the density to be _50 nc then I can't finish it even after a week running on 512 MPI processes. It writes first few outputs in few hours but the subsequent output it takes either a week or sometime it doesn't write and then I had to abort the simulation. I need to find answer to this first and then I'll run the simulations about the particle trajectories sorting again. CP.txt
Do you use OpenMP on these runs ?
I use _OpenMPI 2.0 and also set _export OMP_NUMTHREADS=1 since on our regular cluster I can't use threads effectively but on a smaller machine (224 cores) which is based on the shared memory model, I use less MPI processes (4) and more threads (32). This system might be using OpenMP since it's a shared memory model but somehow I don't need to specify it. On both machines I see problems. Additionally, I started using --mca io romio314 option to mpirun command to avoid any possible issues with parallel I/O on lustre file system.
Did you have a chance to try the Namelist file I attached? I have tried running this job with SMILEI 4.2 and 4.1 versions compiled with hdf5 1.8.16 and 1.8.21. In all cases simulation seem to get stuck somewhere in the middle and even after a week it doesn't write the next output
I just did it 2 days ago. I can see from the scalars diagnostics that the number of generated photons increases exponentially from the point when 1 positron is created. This prohibitively slows the simulation down.
@xxirii Could you suggest one way to approach this issue from your experience with the QED modules ?
If you have an exponential growth of the photons, you can saturate your simulation and slow it down so that you think it is stuck.
Few things you can try :
minimum_chi_discontinuous
. this value is put at 1e-3 in your file. You can try 1e-2. This will avoid the creation of photons of small energy that will not contribute to the generation of pairs. The low energy part of the spectrum is dealt with continuous radiation models.radiation_photon_gamma_threshold
that determines the minimum energy to emit a photon as a macro-particle. Usually, photons around gamma=2 have a small probability to decay into pairs. Therefore, you can increase this parameter to 10 without impacting significantly the production of pairs.cell_sorting
or the vectorization.Let me know if you have questions on how to use these modules.
Thanks to both of you for quick answers. I'll change the xi and minimum threshold for photon production. I have enabled the vectorization and also added a line on the particle merging using method _vraniccartesian in the photon block. I didn't find any info regarding the _cellsorting on the webpage for SMILEI. You said either vectorization or _cellsorting, so just enabling vectorization is fine? Also enabling _mergingmethod in the species block gave the following error
[Python] Exception: ERROR in the namelist: cannot define merging_method
in block Species()
You will need the very last version from github to have particle merging. The same applies for cell_sorting. Set it to True in the Main block if you dont want vectorization.
Ok. I'll fetch it now. Just one clarification, should I enable the vectorization or not? From your last message it seems that if I set _cellsorting to be true then vectorization block is overridden which contradicts what your colleague xxirii suggested.
Actually, vectorization forces cell_sorting, not the other way around. If you dont want vectorization, then cell_sorting should be enough.
If you have never used vectorization, then consider that it is useless if a large portion of your simulation has a low number of particles per cell. If you have more than 20 particles per cell in a large portion of the box, then it could help.
Ok. Thanks for the clarifications. I'm going to compile and launch the simulation.
Yes, you can choose to keep the vectorization off and in this case you have to specify cell_sorting=True
in the Main. Else you can turn on vectorization and this will induce cell_sorting=True
. Nonetheless, I have detected a bug this afternoon trying to run your namelist so that if you switch on vectorization you also have to put cell_sorting=True
. This is corrected in our development branch. Your simulation is running now on Irene Joliot-Curie with particle merging. I will let you know if I reach the end.
After 10000s of simulation, I reached 23000 iterations. I have different remarks:
# Merging
merging_method = "vranic_spherical",
merge_every = 5,
merge_min_particles_per_cell = 8,
merge_max_packet_size = 4,
merge_min_packet_size = 2,
merge_momentum_cell_size = [8,8,8],
merge_discretization_scale = "log",
# Extra parameters for experts:
merge_min_momentum_cell_length = [1e-10, 1e-10, 1e-10],
#merge_accumulation_correction = True,
Here I am using spherical with log scale but you can keep your geometry.
Yes, you can choose to keep the vectorization off and in this case you have to specify
cell_sorting=True
in the Main. Else you can turn on vectorization and this will inducecell_sorting=True
. Nonetheless, I have detected a bug this afternoon trying to run your namelist so that if you switch on vectorization you also have to putcell_sorting=True
. This is corrected in our development branch. Your simulation is running now on Irene Joliot-Curie with particle merging. I will let you know if I reach the end.
Thanks for the answer. I had noticed too and I decided to turn on the _cellsorting in the main block. Thanks for clarifying that it is indeed a bug. I also started a simulation last night and right now it's at 28500 steps.
After 10000s of simulation, I reached 23000 iterations. I have different remarks:
- I can see that you use a Gaussian temporal profile but you don't specify the parameters. I don't know what's the default ones but it is better in your case to start with a short pulse such as 15 or 30 fs. Then, if it works well with a short pulse you can increase the FWHM.
- Secondly, in my simulation, the default merging parameters were not so efficient to trigger particle merging. Therefore, I recommend to be more aggressive for a first try:
# Merging merging_method = "vranic_spherical", merge_every = 5, merge_min_particles_per_cell = 8, merge_max_packet_size = 4, merge_min_packet_size = 2, merge_momentum_cell_size = [8,8,8], merge_discretization_scale = "log", # Extra parameters for experts: merge_min_momentum_cell_length = [1e-10, 1e-10, 1e-10], #merge_accumulation_correction = True,
Here I am using spherical with log scale but you can keep your geometry.
- Last point: what I usually do, I do a first simulation with a larger space and time step that what I target in order to rapidly have a first view of what I want to simulate. This enables me to use less processors for a first try and check that it works although the physics can be a bit wrong. You can try this here by using a short pulse and twice the current discretization for instance.
I had chosen the default parameters first to simulate the process quickly and then fine-tune the parameters later. I'll run another simulation with your suggestions today.
Just a quick update, simulation with lower resolution with your merging parameters finished in 22 hours :) I'm now running the high resolution version with a longer Gaussian pulse.
Good, do you have the scalar of the total number of particles to see if the merging is having an effect ? Do you have as well the time spent in the merging process ? It is given at the end of the simulation.
I'm not sure if I understood you correctly. So I attach the stdout, profile and scalars.txt files. Please have a look. profil.txt stdout.7103430.txt scalars.txt
Thank you, I think the merging is not activated. Can you upload your input file?
Ok. I'm surprised that I managed to finish the simulation just by reducing the step size. I attach the Namelist here.
I did not look well at the stdout, everything is fine in fact, sorry :)
Ok. thanks. Just for my knowledge, can you please point to the instances in stdout where the effect of merging is shown? The other simulation with high resolution is running but is expectedly a bit slower. Can you also recommend going aggressive with particle merging parameters further?
Smilei gives you a summary of what is done at initialization, you can read for the photon species some information related to the merging:
Creating Species : photon
> photon is a photon species (mass==0).
> Pusher set to norm.
> Decay into pair via the multiphoton Breit-Wheeler activated
| Generated electrons and positrons go to species: electron & positron
| Number of emitted macro-particles per MC event: 1 & 1
> Particle merging with the method: vranic_spherical
| Merging time selection: every 5 iterations
| Discretization scale: log
| Minimum momentum: 1.00000e-05
| Momentum cell discretization: 8 8 8
| Minimum momentum cell length: 1.00000e-10 1.00000e-10 1.00000e-10
| Minimum particle number per cell: 8
| Minimum particle packet size: 2
| Maximum particle packet size: 4
This ensures that it has well read the input file.
To be more aggressive, you can use have these parameters:
Thanks for info. I'll try these and launch a new simulation tomorrow.
This discussion has diverted a lot from the original issue. Please open another if necessary
I have two questions.
When I am trying to run a job on 512 MPI processes, I can run the job without any problem. However, when I try launching the same job on 1024 processes, I get an error. I attach all files here. I have number of patches larger than the MPI processes, so I'm bit confused as what is the problem.
About the restart dump, the documentation is a bit incomplete e.g. I don't understand what _dumpdeflate does and if one defines _dumpsteps then we don't need to define necessarily the _dumpminutes? Also I'm trying to use the dump feature of SMILEI to finish a simulation from the last stage when it was crashed or aborted. So when I restart a new simulation (in a new directory) then in the namelist file I can use the checkpoint block to load the simulation stage from the previous dump directory. But then do I need to define species etc. again since I'm starting the same simulation at a later time. Some benchmarks files on dump feature would have helped.
stdout.6569405.txt stdout.6568661.txt CP.txt
stderr.6568661.txt.zip