Parallel Scaling test case for steady state NK-DIST exits with error

hkishnani commented 8 months ago

Was trying out parallel scaling test case as given gdtk/examples/eilmer/3D/parallel-scaling/nk/ by Nick Gibbons (@uqngibbo) to benchmark Eilmer on our local cluster. I did read the README.rst file as given in parallel-scaling folder which said as follows...

To run with a larger number of cores, perhaps for a supercomputer, you may want to try editing the gengrid.lua file to add more cells to the problem. You should be able to do this using the N_REFINE parameter, although you may also have to change a_factor on line 165 to keep the cluster functions happy.

So, I changed the _NREFINE from 0.5 to 2.0 to get a dense grid and when I set the number_of_processors=32,64,128,256 in the parameters.txt file. For the case of -np 32 as given in following make run command:

mpirun -np 32 e4-nk-dist --job=bc --verbosity=1 > LOGFILE 2> ERRFILE

the case gives the following output in LOGFILE. (See attached below) LOGFILE.txt

I don't understand why is it crashing for -np 32 ? because the same case is running fine for -np 64, 128, 256

I have also tried changing a_factor from 0.005 to 0.001 as mentioned in the README.rst but wasn't successful.

rjgollan-on-github commented 8 months ago

Hi Himanshu,

A quick google on unix error codes tells me that an exit 9 is an out of memory error. I suspect these system error codes have remained consistent over decades.

This would seem to gel with your experience that larger core counts do ok. As core count increases, the memory required per core (or node) decreases. (There is some detail on how a cluster is configured for management of the shared memory within a node: does one get all the memory to use across all cores as one chooses, or total mem/ncores per core? Regardless, that detail shouldn't affect this simulation because I suspect that Nick has set this up as optimally load balanced on cores.)

By stepping up a factor of 4 in N_REFINE (0.5 * 4 = 2), this has actually increased the memory requirements by 64 (= 4^3 in 3D). You have a few options. Ignore the n=32 case and just start from 64. This will give you some indication of scaling. Another option is to try a more modest N_REFINE value.

hkishnani commented 8 months ago

That makes sense. I don't know much about memory configuration inside cluster but this is consistent with our observations as for N_REFINE upto 1.0 I was able to run it with -np 32. Thanks a lot.

gdtk-uq / gdtk

Parallel Scaling test case for steady state NK-DIST exits with error #44