SmileiPIC / Smilei

Particle-in-cell code for plasma simulation
https://smileipic.github.io/Smilei
345 stars 121 forks source link

Simulation crashing after successful initialization on Slurm while running fine on TORQUE #521

Closed humanoid-v1911 closed 2 years ago

humanoid-v1911 commented 2 years ago

As already stated in the title I am running into issues running the attached namelist on a large computing facility. On our local cluster (TORQUE based) it runs fine, however due to the increase in size & duration (the 8* in the namelist) which is necessary for the shock simulations I am doing I rely on the increased computation power of an external Slurm based cluster. The initialization works fine as is visible in the output file but after this the job crashes.

I tested the namelist on the local cluster successfully. I tested the external cluster's SMILEI with a benchmark sim and it worked fine so it must be the combination of the namelist and the cluster.

I can't gain any insight further than that from the out/err files.

namelist.txt job submission.txt out.txt err.txt

mccoys commented 2 years ago

Somehow the Poisson solver did not converge. This might mean unphysical fields which may create unphysical particle movement. Start your simulation with all species in regular initialization. This should make the Poisson equation already solved, so zero fields initially.

weipengyao commented 2 years ago

This might not related to the issue, yet I can't help but notice that:

1) in your injection module, the mass of the piston is mi**10, while the electron mass is mi**2?? Are they intentionally that large or typos?

2) Also, the output data will be 170 T, which I guess you don't like, right? You might want to reduce the output frequency.

Since you have profiles for your plasma density, I think one way to pinpoint the issue is to add the species sections in your namelist part by part.

Good luck!

humanoid-v1911 commented 2 years ago

@mccoys I initialized all particles but those of the injector with regular which didn't help, the error output stays more or less the same.

@weipengyao 1) Yes, the piston particles are supposed to be this heavy because they separated too much otherwise and invaded the plasma of interest. As they only serve the purpose of pushing this should be fine. 2) Indeed. I just haven't bothered yet as the sim doesn't run in the first place.

I'll try to ex-/include the species step by step and see if that helps.

mccoys commented 2 years ago

I managed to remove the segfault by setting the injectors to regular, but I don't really know the results and I did not study the reasons yet

humanoid-v1911 commented 2 years ago

But this will lead to charge separation right? Or is the MJ momentum distribution not as affected by this?

mccoys commented 2 years ago

Right, I was not considering physical effects yet. Just trying to understand the cause. I wonder whether this would point to a numerical instability considering the grid and time resolution.

humanoid-v1911 commented 2 years ago

Ok, I've tried now a regularly initialized injector (other species as well) which didn't resolve segmentation for me nor any other issue. I assume it probably is a numerical instability then? I will play around with the grid and time resolution.

It's still weird that the exact same namelist works fine on a less powerful cluster.

mccoys commented 2 years ago

Ah I did not realize it worked for the exact same namelist. I thought there was some scaling done. Now this may be a real bug.

humanoid-v1911 commented 2 years ago

Even the scaled down version which for sure worked isn't running. I played around with different resolutions but besides running maybe a couple time-steps longer no real improvements.

mccoys commented 2 years ago

Did you use the same version for both machines? It could be a bug that is related to specific compiler details

humanoid-v1911 commented 2 years ago

No. In fact this is what I was trying the past days. On our local cluster I used gcc and on the external one it’s an intel compiler. However, there are some issues regarding version availability and compatibility of some modules that I need to get ironed out by the admin first to get a successful compilation.

humanoid-v1911 commented 2 years ago

Ok, so I have tried compiling with gcc but am experiencing similar issues (segmentation fault). A colleague was able to run the job on the same external cluster, so my best guess is that there is something weird going on with my build/machine.

humanoid-v1911 commented 2 years ago

Was able to fix it. For some reason the number of patches needed to be increased very drastically, although working with the exact same setup on our local cluster. I don't know why this is an issue but at least it seems to work now.

mccoys commented 2 years ago

It might point to a memory issue when many particles or fields are duplicated at once