gdtk-uq / gdtk

The Gas Dynamics Toolkit (GDTk) is a set of software tools for simulating high speed fluid flow, maintained at The University of Queensland and the University of Southern Queensland, Australia.
https://gdtk.uqcloud.net/
Other
59 stars 15 forks source link

High Memory usage in simulation #46

Closed jr144393 closed 7 months ago

jr144393 commented 8 months ago

When running e4mpi, I have very large memory issues that arise causing seg fault failures. I am looking for reasons why this may be occurring. The case I am running is a simple 2D case, with 315,000 cells, which is by no means a large grid. The RAM usage on this case is nevertheless totaling ~944 Gb, which is significant even for large memory compute nodes.

This error seems very odd, can someone assist with debugging?

Thank you.

uqngibbo commented 8 months ago

That is definitely more memory than I would expect. Can you please provide some detail of what simulation you are trying to run, particularly the partitioning and submission commands?

jr144393 commented 7 months ago

Certainly. I am running on a grid made in Eilmer, in this case it is structured. I am using the SA turbulence model with ideal air, and am calculating nu_inf similarly to how it is calculated in 2D/flat-plate-transitional-sabcm. The flux calculator is the default. I have 5 rectangular shapes each with ~300 by 150 cells. Each of these is blocked into nib=16 by njb=16. Resulting in 1280 needed cores.

I use identifyBlockConnections() then mpiTasks=mpiDistributeBlocks{ntasks=1280, dist="load-balance"}

In my run script, I am on a pbs system, and I use :

PBS -l select=12:ncpus=120:mpiprocs=120

PBS -l walltime=100:00:00

aprun -n 1280 e4mpi --run --verbosity=1 e4shared --post --job=dcbn --vtk-xml

This fails with a segmentation fault. Do you see anything here that stands out as being done incorrectly? Please let me know if I need to provide more information, and thank you for your continued assistance.

uqngibbo commented 7 months ago

Can you try the following changes, just temporarily, to see what happens: 0.) Recompile Eilmer using FLAVOUR=debug 1.) Change to nib=1 and njb=1 on all FluidBlockArrays 2.) Comment out the mpiDistributeBlocks line in your lua script 3.) Change your PBS configuration to:

PBS -l select=1:ncpus=5:mpiprocs=5

4.) Change your run command to: aprun -n 5 e4mpi --run --job=dcbn

Let me know if that works and if not, whether you get more information output.

jr144393 commented 7 months ago

Performed these steps, and running with 5 cores based on the method mentioned here, the code has been running without error for 90 hours. What does this tell us? Is there a scalability issue in this particular case? Are there other things to try to debug this?

Using qstat -f #JOBID gives me more information which may be useful (for this run with 5 cores): resources_used.cpupercent = 500 resources_used.cput = 457:13:19 resources_used.mem = 3701336kb resources_used.ncpus = 5 resources_used.vmem = 37031356kb resources_used.walltime = 91:29:42

uqngibbo commented 7 months ago

Okay that's good. It shows that there isn't some kind of fundamental configuration error with the simulation. I would try switching back to FLAVOUR=fast now and then run the simulation again with more cores. Maybe start with a single node and then move to two if you feel like that is taking too long. This should also give you a chance to assess the parallel performance on your given job size.

jr144393 commented 7 months ago

Great. Thank you for the explanation. I will attempt slowly ramping up the node usage to test parallel performance on this job. Do you have any suggestions on tracking those stats? (Anything other than observing normal items like the output sim time vs real time, etc)?

jr144393 commented 7 months ago

For clarification, would you suggest using the mpidistribute blocks command as I scale up? Or only once I scale above a single node?

uqngibbo commented 7 months ago

Good question: mpiDistributeBlocks is only needed when you have more blocks than processors. What it will do is try and assign each block to a processor in a way that makes the overall load fairly balanced, but it usually isn't a good idea for high performance runs. The best thing to do is make sure you have roughly evenly sized blocks, with a total number that matches your target core count.

jr144393 commented 7 months ago

Got it - thank you for this information, I think this was affecting performance on other runs, so I am testing this out there as well. I have noted performance improvements with this, getting to the same simulation time after the same physical time with less processors as compared to the higher processor count case with mpiDistributeBlocks. Likely this was because I have been using the exact number of blocks as the number of processors, so the distributeblocks step was just adding unnecessary steps to the simulation in this case. I am willing to have this issue marked as completed if you think there is nothing else to test here. Again thank you for the assistance.

uqngibbo commented 7 months ago

That's great! You're very welcome.