SmileiPIC / Smilei

Particle-in-cell code for plasma simulation
https://smileipic.github.io/Smilei
344 stars 121 forks source link

Segmentation faults #704

Closed Tissot11 closed 2 months ago

Tissot11 commented 8 months ago

Hi,

I could run one simulation with three restarts successfully but on the fourth restart, I see segmentation faults after 6 hours of runtime. I attach the err and out files. Please let me now what could be reason for this since physical results look to be fine until the crash.

On another machine, running Smilei sometimes triggers kernel panic bug in InfiniBand drivers leading to the node failures, as told to me the support team. Is this common to occur and could be some remedies for avoiding this sort of crash?

tjob_hybrid.err.9777446.txt tjob_hybrid.out.9777446.txt

mccoys commented 8 months ago

I see (Address not mapped to object [0xfffffffffffffffd] and failed: Cannot allocate memory

You probably ran out of memory

Tissot11 commented 8 months ago

Smilei output file shows very little memory usage e.g. 60 GB while the nodes have 256 GB memory per node. In past I did encounter memory issues, but then Smilei output file would also show it.

beck-llr commented 8 months ago

I agree with @mccoys , it looks like a memory problem. Where did you see a memory occupation of 60 GB ?

In any case, the memory occupation is always underestimated because of many temporary buffers. A more accurate (but still underestimated) way to measure memory occupation is to use the Performance diagnostic. A possible scenario is that a strong load imbalance drives a peak of memory occupation on a single node and crashes it.

I notice that you are using very small patches with respect to your number of threads (more than 100 patches per openMP threads). You can try using larger patches. This should reduce the memory overhead induced by patches communication.

If you detect a peak of memory occupation somewhere that crashes a node you can also consider using the particle merging feature to mitigate that effect.

Tissot11 commented 8 months ago

It's in the stdout files I had attached with this message earlier (see the first message). It says 60 GB. I can use the performance diagnostics to see if indeed the memory is the issue.

Last year, I had asked about memory issue and followed up on your suggestion to use larger patches. However, the runtimes get really slow and I couldn't finish simulations even after restarting them few times. Then I tried large no.processors e.g. 35000 for this problem. I could finish simulations in shorter time albeit the CPU usage was a bit low. I have also tried last year particle merging feature, but I couldn't optimize the merging parameters very well for my simulations.

Tissot11 commented 8 months ago

Looking at the memory bandwidth per socket, I see very little memory usage (see the attached file)

9777446.pdf

beck-llr commented 8 months ago

If you need small patches for performance it confirms that your case is strongly imbalanced. It also explains why you have a poor CPU usage when scaling. It should show on the performance diag. Any chance you could use more openMP threads and less MPI processes ? Or are you already bound by the number of cores per socket of your system ?

mccoys commented 8 months ago

At the end of the stdout, it says:

Maximum memory per node: 57.321124 GB (defined as MaxRSS*Ntasks/NNodes)

Is that used memory or available memory? I ask because in your document, the maximum memory per node appears to be about 50 GB, which is dangerously close to that limit above.

Tissot11 commented 8 months ago

@mccoys Maximum memory per node is 256 GB. @beck-llr it’s a collisionless shock simulation so of course it can be imbalanced. I tried vectorization, SDMD, particle merging and OpenMP tasks to speed things up but with a limited success so far. I’m only using either 4 or 6 MPI processes per node and 12 and 19 OpenMP threads on two different machines because this gives the best performance.

Tissot11 commented 8 months ago

Just to add that vectorization does help and compute time improves by 2x.

mccoys commented 8 months ago

Note that load balancing produces a memory spike that can be very substantial. The crash appears at that moment and it seems related to MPI that is not able to send all the data between MPIs. Have you tried to do load balancing more/less often?

Tissot11 commented 8 months ago

I do load balancing rather often, every 150 iterations. Should I even increase more? I can try it tonight.

mccoys commented 8 months ago

No I bet you should reduce. If you do it rarely, it has to do a lot of patch swaps. Meaning a lot of memory allocation.

The default is 20, but maybe not optimal for your case

Tissot11 commented 8 months ago

Ok. I have launched a job and I'll let you if it works with aggressive load balancing. I have set every=40. Just to be sure the default load balancing isevery=150 as written on the documentation page? I'm using vectorization every=20.

beck-llr commented 8 months ago

Yes the default is 150 according to pyinit.py. Another metric that you can monitor is the number of patches per MPI process. You can directly check it out in the patch_load.txt file. It displays the number of patches per mpi process after each load balance operation. You have a problem if an mpi process ends up with only a couple of patches.

Tissot11 commented 8 months ago

Unfortunately this simulation failed even earlier than before. I attach the err, out and patch_textfiles. From patch_text file, it see almost 200 from 1000 patches per thread. So I guess this is fine?

Although the simulation is imbalanced but when I plot the results until the crash, I don't see any unexpected behaviour. Everything seems physical and expected. This is why I'm worried. I asked the technical support and they seem to also suggest that debugging this would be very hard.

tjob_hybrid.err.9787451.txt tjob_hybrid.out.9787451.txt

patch_load.txt

mccoys commented 6 months ago

I had another quick look at this issue and UCX errors are usually related to MPI or network settings, allowing for different memory or cache amounts for MPI transfers. It is not directly a Smilei issue, so I am closing this.

mccoys commented 5 months ago

Reopening from indication of @Tissot11 elsewhere that this is a regression as it used to work in v5.0. Can you confirm this? Do you have a case we could test?

Tissot11 commented 5 months ago

Yeah, I do have a case...After I switched to Smilei v5.0 last year, I have seen numerous segmentation faults (with 2D simulations) on different machines with different compilers and libraries. Last month, I could manage to run the same simulation I complained about in the beginning of this thread with Smilei v4.7 without a segmentation fault or memory related crash.

Because of these widespread segmentation faults, I started using other codes for simulations. If you investigate this issue and we can hope to resolve it quickly then I can prepare a case and give to you...

mccoys commented 5 months ago

It depends whether we are able to reproduce the error. If this error requires a large allocation to reproduce, it will take longer of course

beck-llr commented 5 months ago

Hi. It is indeed a large simulation and it will be difficult to provide a fix if one is really required.

@Tissot11 are you positive that there is a regression and that you observe the crash in an exactly identical configuration as before (same simulation size, number of patches, physical configuration, compiler, mpi module etc.) ?

I had a look at the logs you provided and it is indeed an extremely unbalanced simulation. After the last load balancing the number of patches per MPI rank spans from 176 to 4680 !! I assume this puts a lot of pressure on the dynamic load balancing process and MPI exchanges. Moreover you are using a very high number of patches which also increases memory and communication overheads. Even 176 is a lot of patches when you have only 12 openMP threads.

I would strongly advise to divide your total number of patches by at least a factor of 4. You previously answered that this would slow down your simulation too much. By how much did you decrease your number of patches ? Did you check the minimum number of patch per MPI ? As long as you have at least 24 patches per MPI (with 12 openMP threads) it should not slow down dramaticaly. If you go down to less than one patch per tread is when you are going too far.

P.S: You may observe a serious slow down because of cash effect beyond a certain patch size. In that case you could try to increase your number of patch by only a factor of 2. I'd be really surprised if it didn't help but you can never know for sure :-)

beck-llr commented 5 months ago

Also for the particle merging to be efficient, you need to know what the distribution of your macro particles in your most populated patches/cells look like. I'm still convinced it could be very helpful in your case but it does require a bit of tuning. Note that the default merge_momentum_cell_size is VERY conservative. Do not hesitate to reduce it significantly. On the opposite, make sure tat the merge_min_particles_per_cell is not too low. You're only interested in merging particles in cells with many more particles than average.

Tissot11 commented 5 months ago

Indeed, the problems I first reported were with large 2D simulations. One of this simulation with larger domain and >25K CPUs I could manage to run with Smilei v4.7 (without any filtering, not so efficient as you explained due to patches) using older Intel compiler and libraries (compiler/intel/2022.0.2 numlib/mkl/2022.0.2 mpi/impi/2021.5.1 lib/hdf5/1.12) on Horeka. I should emphasise that I'm mostly using interpolation oder 4 but sometimes I also use order 2.

However, I have now prepared a simple case (2D) that I ran on 8 nodes of Hawk at HLRS, and 4 nodes of another HPC machine. To summarize

  1. This simulation run fine with custom MPI library MPT at HLRS and OpenMPI 5.0 and gcc 10.2. However, it starts showing segmentation faults with OpenMPI if I just change the mass ratio and nothing else in the namelist.
  2. Even with MPT library, it shows segmentation faults (also with Smilei 4.7) if I enable Friedman filter. Also with intel mpi library on another machine same segmentation faults. Please see the attached namelist.

I fear that newer compilers and changes made in Smilei 5.0 have some subtle issues, at least for 2D simulations since 1D simulations I do not see any issues. I have spent lots of time trying to run the same and similar 2D simulations with several combinations of libraries and compilers and spent last few months talking with technical supports, and nothing could come out. This is why I have started using other codes.

I will be very happy if we could figure this out so that I can use Smilei for 2D simulations.

namelist.py.txt Shock_test.e2581658.txt Shock_test.e2581722.txt Shock_test.e2581744.txt Shock_test.e2583577.txt Shock_test.e2583698.txt tjob_hybrid.err.12833017.txt

Tissot11 commented 5 months ago

Also for the particle merging to be efficient, you need to know what the distribution of your macro particles in your most populated patches/cells look like. I'm still convinced it could be very helpful in your case but it does require a bit of tuning. Note that the default merge_momentum_cell_size is VERY conservative. Do not hesitate to reduce it significantly. On the opposite, make sure tat the merge_min_particles_per_cell is not too low. You're only interested in merging particles in cells with many more particles than average.

I had this problem last year with memory. I started using interpolation order 4 and less number of particles-per-cell and also launching 4-6 MPI processes and 12 OpenMP threads on a single node. With this approach, I would not have memory issues anymore as the memory usage reported by every tool remain below 256 GB per node. However, sometime I saw memory errors related segmentation faults as I reported before which you and @mccoys attributed to intermittent memory spikes which I would not catch in any performance monitoring tools. I suspect, problem is with the MPI communication and that's why segmentation faults have become a very frequent occurrence with these 2D simulations.

beck-llr commented 5 months ago

mi and vUPC are undefined in the namelist you provided.

Tissot11 commented 5 months ago

Sorry! This is a redacted version and I forgot that I still use these parameters later in the diagnostic

namelist.py.txt

beck-llr commented 5 months ago

Thanks. I have tried with the dummy values mi=50 and vUPC=0.01 and was able to reproduce a problem. I will look into it.

mccoys commented 5 months ago

@beck-llr would it be possible to have a maximum_npatch_per_MPI in load balancing? It would prevent overloading ranks when there is a strong load imbalance. Maybe this is not the issue here but the older logs really look like MPI is overloaded.

Now the new logs are different so we have to see (errors in the projectors usually means that particles are not where they are supposed to be).

beck-llr commented 5 months ago

@mccoys There are already options to tune the dynamic load balancing like cell_load for instance which will influence min and max number of patch per MPI. In the present case I am more concerned by forcing a minimum number of patch (which can be achieved by increasing the cell load). But in fact the min and max are linked as if you increase the min, you mechanically decrease the max.

From my first tests the problem here now lies within the Friedman filter. I think it has been problematic for a while. This is a good opportunity to have a close look at it.

Tissot11 commented 5 months ago

@beck-llr , so should I change the cell_load for simulations? I never set it up in my simulations. As @mccoys says something automatic to reduce load imbalance would be useful since most of the plasma physics simulations have load imbalance situations after a short interaction time. With laser-solid interactions, this could even be more demanding than shock simulations...

Besides, the Friedman filter, I have also seen segmentation faults with different MPI libraries. In general, it would be nice to have Smilei at least working with OpenMPI always and shows no segmentation faults except for obvious understandable reasons....

Tissot11 commented 5 months ago

I was wondering if there is any relevant info you would want to share at this stage?

Tissot11 commented 4 months ago

I would appreciate if you let me know the possible causes of these segmentation faults and whether you intend to address them soon. This would help me to decide if I should wait to use Smilei for simulating this problem or not...

beck-llr commented 4 months ago

The bug of the Friedman filter is reproducible and will be fixed eventually in a relatively short term.

For the rest, the issue is unclear and not reproducible for the moment. It does not mean there is no problem but it does not affect many people and I don't know exactly what we can do about it. Do you think you could provide a case that reproduces the problem which does not use the Friedman filter ?

Tissot11 commented 4 months ago

The same namelist also crashed without the Friedman filter for me when choosing a larger simulation domain and longer duration. Though, I could run it successfully albeit inefficiently, using the older Smilei version (4.7). Even this reduced version show suddenly higher push times after a sufficient long simulation runtime. I guess this sudden increase of push times (more than a factor of 5 and 6) could be linked to memory load, leading to segmentation faults as mentioned before in this thread. However, I could never catch any unreasonable memory usage in any tools that I have at my disposal.

beck-llr commented 2 months ago

Hi The problem with the Friedman filter was actually not a bug but a consequence of the filter which constraints the CFL condition which was not verified any more with the time step that you use: timestep_over_CFL = 0.95. Using a smaller timestep, for instance timestep = 0.5 * dx as suggested in the documentation of the Friedman filter fixed the problem.

That made me wonder if the other crashes that you experienced ( which I could not reproduce) are also consequences of a bad CFL ? I suggest to try smaller timesteps and see if they keep happening.

I am going to close this issue now but do not hesitate to reopen it if you have additional information to share.

Tissot11 commented 2 months ago

Ok. Thanks for this tip. The magic time-step condition is listed on the other page so it escaped my attention since I didn't use Friedmann filter before and I was experimenting with it for the first time.

For other simulations, segmentation faults might be connected with memory issue since I have plasma constantly coming in though I have not been able to completely figure it out. I'll try to stop the particle injector at some time since it appears that I couldn't fix the maximum number of particle_per_cell.

I just wanted to reply to this thread. Please close it afterwards.

beck-llr commented 2 months ago

Thanks for the feedback.