Tissot11 commented 5 years ago

I have two questions.

When I am trying to run a job on 512 MPI processes, I can run the job without any problem. However, when I try launching the same job on 1024 processes, I get an error. I attach all files here. I have number of patches larger than the MPI processes, so I'm bit confused as what is the problem.
About the restart dump, the documentation is a bit incomplete e.g. I don't understand what _dumpdeflate does and if one defines _dumpsteps then we don't need to define necessarily the _dumpminutes? Also I'm trying to use the dump feature of SMILEI to finish a simulation from the last stage when it was crashed or aborted. So when I restart a new simulation (in a new directory) then in the namelist file I can use the checkpoint block to load the simulation stage from the previous dump directory. But then do I need to define species etc. again since I'm starting the same simulation at a later time. Some benchmarks files on dump feature would have helped.

stdout.6569405.txt stdout.6568661.txt CP.txt

stderr.6568661.txt.zip

mccoys commented 5 years ago

Your error is probably related to the first warning : CFL is not satisfied.

Concerning dumps, I will write something better soon. You must use the same input, except the Checkpoint part.

Tissot11 commented 5 years ago

I'm not sure if it's due to CFL condition. Since I have the same warning for 512 MPI processes and the simulation is running fine till now. This CFL issue arose since today I enlarged the size in x-direction while keeping the time-step same. Last night when I tried to run on 1024 MPI processes for smaller x-size, I had no warning about CFL issue but the stderr file had the same entries. I hope it's not connected with the HDF5 issue that we discussed in other thread #144

About restarting the simulation, one should have checkpoint block with _restartdir option specified to load the simulation stage from the previous simulation dump? If one doesn't use the checkpoint part then how I'll use the dump from previous run?

mccoys commented 5 years ago

The CFL problem is due to a timestep too large. This typically crashes the code. It is a very bad sign to have this warning, unless you have very special conditions.

For checkpoints, you need a first simulation with checkpoints already defined, including some dump_step or dump_minutes. This ensures that the whole simulation is stored at some point. The second simulation must have the same input, except that it requires restart_dir to point at the previous simulation directory.

Tissot11 commented 5 years ago

Thanks for clarifying about the restart dump! About CFL as said before I attach std files for the same simulation with adjusted timestep. Now I have no CFL warning but the simulation can not run for 1024 MPI processes but it runs fine for 512 MPI processes. The reason is that it's taking more than a week and it hasn't finished it yet on 512 MPI processes. So I wanted to launch it on 1024 so that it finishes fast. But it's precluding me from doing so.

Archive.zip

Tissot11 commented 5 years ago

I'm trying to restart a simulation that was aborted. I ran it on 512 MPI processes and kept 2 dumps. I want to restart from 000001 dump. I can see all 512 dump files in the Checkpoints directory. However when I run using the same Namelist from another directory it complains that it can't find all the files. So I try to loop over in Checkpoints block but it complains that it can not parse the text. I paste below the Checkpoints block. Please suggest how should I proceed?

Checkpoints( restart_dir = "../LS_CP_150_RR_D/checkpoints/", for i in range(511): restart_number = "00001-000000000"+str(i),

dump_step = 5700,

    #dump_minutes = 240.,
    #dump_deflate = 0,
    exit_after_dump = False,
    #keep_n_dumps = 2,
      )

beck-llr commented 5 years ago

Hi

The restart_dir should be your simulation directory not the checkpoint directory. Also do not forget to keep your dump_step or dump_minutes argument up otherwise the restarted simulation won't drop any checkpoints any more and you won't be able to run a third restart.

Checkpoints( restart_dir = "../LS_CP_150_RR_D", dump_step = 5700, dump_minutes = 240., exit_after_dump = False, keep_n_dumps = 2, )

Tissot11 commented 5 years ago

Thanks for the quick reply! Now the simulation starts but I have an additional question. This simulation was aborted before the last time step. So the restarting should be from the last time step it dumped the file? I look at the stdout file of the previous simulation run (see the attached picture), where I see at t=51300 the dump is written as 0 and I hope it should start from there?

mccoys commented 5 years ago

The restart should happen at the most recent time when a dump occurred unless the restart number was specified.

You can easily see that when the new simulation starts

Tissot11 commented 5 years ago

Ok. Then it should happen at t=51300. I'm a bit nervous since I do not see yet the entries in output (see figure attached). I have used your block for checkpoints.

UPDATE: job has now been completed.

Can you please also have a look at the issue regarding the 1024 MPI processes that I had posted three days ago in this thread and attached the files in archive.zip?

mccoys commented 5 years ago

Concerning the restarts, note that you can change the simulation duration if you need it longer.

Concerning you previous crash, the error is very obscure and there is nothing that we could really conclude from it. It says something about a network error, and that would be unrelated to Smilei. However, I can't really tell from the log. As it happens from the point where MPI is initialized, I would say this is a node failure. Have you tried again ?

Tissot11 commented 5 years ago

It's good to know that on restart, one can change the duration. It's a nice feature and I'll be using it often.

I had tried launching on 1024 MPI processes three days ago and I can try it again today. There might have been a network issue on our cluster. What I find also surprising that this restart took few minutes to finish while running this simulation last week it was stuck on the last step for more than 30 hours... Based on the issue I discussed in thread #144, it appears to me that we have some strange problems with HDF5 on our cluster, specially when it comes to finalising the writing and reading of the HDF5 files. Can you please tell me which OS, and which version of HDF5 you use for SMILEI?

mccoys commented 5 years ago

We run smilei on many machine. Often RedHat Enterprise Linux, but this does not matter.

HDF5 1.8.16 or 1.10 work fine.

Filesystem is usually lustre.

Tissot11 commented 5 years ago

Ok. Thanks. Which version of openmpi you use?

mccoys commented 5 years ago

In most supercomputers we have used Intel mpi 2017 to 2019.

We have used openmpi 2 I believe, but not recently

mccoys commented 5 years ago

@Tissot11 Did you manage to confirm your previous crash was related to your machine ?

Tissot11 commented 5 years ago

No. I need to run new simulations since I continued to have the same issues. Now I've complied the current version of SMILEI with the older version of hdf5 1.8.21. But I need to desperately run some other simulations and strangely I can't finish them even in a week. Our SysAdmin suspects some problems with SMILEI and I feel all these issues are connected with each other. If you have time then can you please try the Namelist I attach here? This simulation if I run for plasma density _400 nc then it finishes in 6 hours but if I change the density to be _50 nc then I can't finish it even after a week running on 512 MPI processes. It writes first few outputs in few hours but the subsequent output it takes either a week or sometime it doesn't write and then I had to abort the simulation. I need to find answer to this first and then I'll run the simulations about the particle trajectories sorting again. CP.txt

mccoys commented 5 years ago

Do you use OpenMP on these runs ?

Tissot11 commented 5 years ago

I use _OpenMPI 2.0 and also set _export OMP_NUMTHREADS=1 since on our regular cluster I can't use threads effectively but on a smaller machine (224 cores) which is based on the shared memory model, I use less MPI processes (4) and more threads (32). This system might be using OpenMP since it's a shared memory model but somehow I don't need to specify it. On both machines I see problems. Additionally, I started using --mca io romio314 option to mpirun command to avoid any possible issues with parallel I/O on lustre file system.

Tissot11 commented 5 years ago

Did you have a chance to try the Namelist file I attached? I have tried running this job with SMILEI 4.2 and 4.1 versions compiled with hdf5 1.8.16 and 1.8.21. In all cases simulation seem to get stuck somewhere in the middle and even after a week it doesn't write the next output

mccoys commented 5 years ago

I just did it 2 days ago. I can see from the scalars diagnostics that the number of generated photons increases exponentially from the point when 1 positron is created. This prohibitively slows the simulation down.

@xxirii Could you suggest one way to approach this issue from your experience with the QED modules ?

xxirii commented 5 years ago

If you have an exponential growth of the photons, you can saturate your simulation and slow it down so that you think it is stuck.

Few things you can try :

change the minimum value of chi that trigger emission of photons : minimum_chi_discontinuous. this value is put at 1e-3 in your file. You can try 1e-2. This will avoid the creation of photons of small energy that will not contribute to the generation of pairs. The low energy part of the spectrum is dealt with continuous radiation models.
you can try to increase the parameter radiation_photon_gamma_threshold that determines the minimum energy to emit a photon as a macro-particle. Usually, photons around gamma=2 have a small probability to decay into pairs. Therefore, you can increase this parameter to 10 without impacting significantly the production of pairs.
last possibility, you can try to use the particle merging module recently developed. It is quite efficient on photons. For this aim, you will have to activate the cell_sorting or the vectorization.

Let me know if you have questions on how to use these modules.

Tissot11 commented 5 years ago

Thanks to both of you for quick answers. I'll change the xi and minimum threshold for photon production. I have enabled the vectorization and also added a line on the particle merging using method _vraniccartesian in the photon block. I didn't find any info regarding the _cellsorting on the webpage for SMILEI. You said either vectorization or _cellsorting, so just enabling vectorization is fine? Also enabling _mergingmethod in the species block gave the following error

[Python] Exception: ERROR in the namelist: cannot define merging_method in block Species()

mccoys commented 5 years ago

You will need the very last version from github to have particle merging. The same applies for cell_sorting. Set it to True in the Main block if you dont want vectorization.

Tissot11 commented 5 years ago

Ok. I'll fetch it now. Just one clarification, should I enable the vectorization or not? From your last message it seems that if I set _cellsorting to be true then vectorization block is overridden which contradicts what your colleague xxirii suggested.

mccoys commented 5 years ago

Actually, vectorization forces cell_sorting, not the other way around. If you dont want vectorization, then cell_sorting should be enough.

If you have never used vectorization, then consider that it is useless if a large portion of your simulation has a low number of particles per cell. If you have more than 20 particles per cell in a large portion of the box, then it could help.

Tissot11 commented 5 years ago

Ok. Thanks for the clarifications. I'm going to compile and launch the simulation.

xxirii commented 5 years ago

Yes, you can choose to keep the vectorization off and in this case you have to specify cell_sorting=True in the Main. Else you can turn on vectorization and this will induce cell_sorting=True. Nonetheless, I have detected a bug this afternoon trying to run your namelist so that if you switch on vectorization you also have to put cell_sorting=True. This is corrected in our development branch. Your simulation is running now on Irene Joliot-Curie with particle merging. I will let you know if I reach the end.

xxirii commented 5 years ago

After 10000s of simulation, I reached 23000 iterations. I have different remarks:

I can see that you use a Gaussian temporal profile but you don't specify the parameters. I don't know what's the default ones but it is better in your case to start with a short pulse such as 15 or 30 fs. Then, if it works well with a short pulse you can increase the FWHM.

Secondly, in my simulation, the default merging parameters were not so efficient to trigger particle merging. Therefore, I recommend to be more aggressive for a first try:

    # Merging
    merging_method = "vranic_spherical",
    merge_every = 5,
    merge_min_particles_per_cell = 8,
    merge_max_packet_size = 4,
    merge_min_packet_size = 2,
    merge_momentum_cell_size = [8,8,8],
    merge_discretization_scale = "log",
    # Extra parameters for experts:
    merge_min_momentum_cell_length = [1e-10, 1e-10, 1e-10],
    #merge_accumulation_correction = True,

Here I am using spherical with log scale but you can keep your geometry.

Last point: what I usually do, I do a first simulation with a larger space and time step that what I target in order to rapidly have a first view of what I want to simulate. This enables me to use less processors for a first try and check that it works although the physics can be a bit wrong. You can try this here by using a short pulse and twice the current discretization for instance.

Tissot11 commented 5 years ago

Yes, you can choose to keep the vectorization off and in this case you have to specify cell_sorting=True in the Main. Else you can turn on vectorization and this will induce cell_sorting=True. Nonetheless, I have detected a bug this afternoon trying to run your namelist so that if you switch on vectorization you also have to put cell_sorting=True. This is corrected in our development branch. Your simulation is running now on Irene Joliot-Curie with particle merging. I will let you know if I reach the end.

Thanks for the answer. I had noticed too and I decided to turn on the _cellsorting in the main block. Thanks for clarifying that it is indeed a bug. I also started a simulation last night and right now it's at 28500 steps.

Tissot11 commented 5 years ago

After 10000s of simulation, I reached 23000 iterations. I have different remarks:

I can see that you use a Gaussian temporal profile but you don't specify the parameters. I don't know what's the default ones but it is better in your case to start with a short pulse such as 15 or 30 fs. Then, if it works well with a short pulse you can increase the FWHM.

Secondly, in my simulation, the default merging parameters were not so efficient to trigger particle merging. Therefore, I recommend to be more aggressive for a first try:
        # Merging
        merging_method = "vranic_spherical",
        merge_every = 5,
        merge_min_particles_per_cell = 8,
        merge_max_packet_size = 4,
        merge_min_packet_size = 2,
        merge_momentum_cell_size = [8,8,8],
        merge_discretization_scale = "log",
        # Extra parameters for experts:
        merge_min_momentum_cell_length = [1e-10, 1e-10, 1e-10],
        #merge_accumulation_correction = True,
Here I am using spherical with log scale but you can keep your geometry.

Last point: what I usually do, I do a first simulation with a larger space and time step that what I target in order to rapidly have a first view of what I want to simulate. This enables me to use less processors for a first try and check that it works although the physics can be a bit wrong. You can try this here by using a short pulse and twice the current discretization for instance.

I had chosen the default parameters first to simulate the process quickly and then fine-tune the parameters later. I'll run another simulation with your suggestions today.

Tissot11 commented 5 years ago

Just a quick update, simulation with lower resolution with your merging parameters finished in 22 hours :) I'm now running the high resolution version with a longer Gaussian pulse.

xxirii commented 5 years ago

Good, do you have the scalar of the total number of particles to see if the merging is having an effect ? Do you have as well the time spent in the merging process ? It is given at the end of the simulation.

Tissot11 commented 5 years ago

I'm not sure if I understood you correctly. So I attach the stdout, profile and scalars.txt files. Please have a look. profil.txt stdout.7103430.txt scalars.txt

xxirii commented 5 years ago

Thank you, I think the merging is not activated. Can you upload your input file?

Tissot11 commented 5 years ago

Ok. I'm surprised that I managed to finish the simulation just by reducing the step size. I attach the Namelist here.

CP.txt

xxirii commented 5 years ago

I did not look well at the stdout, everything is fine in fact, sorry :)

Tissot11 commented 5 years ago

Ok. thanks. Just for my knowledge, can you please point to the instances in stdout where the effect of merging is shown? The other simulation with high resolution is running but is expectedly a bit slower. Can you also recommend going aggressive with particle merging parameters further?

xxirii commented 5 years ago

Smilei gives you a summary of what is done at initialization, you can read for the photon species some information related to the merging:

     Creating Species : photon
         > photon is a photon species (mass==0).
         > Pusher set to norm.
         > Decay into pair via the multiphoton Breit-Wheeler activated
             | Generated electrons and positrons go to species: electron & positron
             | Number of emitted macro-particles per MC event: 1 & 1
         > Particle merging with the method: vranic_spherical
             | Merging time selection: every 5 iterations
             | Discretization scale: log
             | Minimum momentum: 1.00000e-05
             | Momentum cell discretization: 8 8 8 
             | Minimum momentum cell length: 1.00000e-10 1.00000e-10 1.00000e-10 
             | Minimum particle number per cell: 8
             | Minimum particle packet size: 2
             | Maximum particle packet size: 4

This ensures that it has well read the input file.

To be more aggressive, you can use have these parameters:

merge_momentum_cell_size = [8,8,8] means that the momentum space discretization is composed of 8x8x8 cells. You can decrease these values to allow more merging. However, the larger the momentum cell size, the larger the merging error.
merge_max_packet_size = 4 means that you merge maximum 4 particles into 2. If you increase this number you can merge more particles at one time into 2. You will therefore decrease more rapidly the number of particles but you will have less particles to describe the physics.
merge_min_particles_per_cell = 8 means that you only activate the merging when there is more than 8 particles in a cell. If you put 4, may be you can have more merging.

Tissot11 commented 5 years ago

Thanks for info. I'll try these and launch a new simulation tomorrow.

mccoys commented 5 years ago

This discussion has diverted a lot from the original issue. Please open another if necessary

SmileiPIC / Smilei

Understanding patches and dump feature in SMILEI #158

dump_step = 5700,