ExtremeFLOW / neko

/ᐠ. 。.ᐟ\ᵐᵉᵒʷˎˊ˗
https://neko.cfd/
Other
175 stars 32 forks source link

gslib segfaults on LUMI/Dardel for large cases #1356

Open vbaconnet opened 4 months ago

vbaconnet commented 4 months ago

As the title says, I have encountered issues running with probes on Dardel (GPU and CPU) and LUMI-G.

Observed behaviour

Simulation freezes and segfaults at fgslib_findpts_setup in global_interpolator.

Only error message that is dumped is as follows:

srun: error: nid001799: task 175: Segmentation fault (core dumped)

The case is a simple box with constant inflow/outflow. I attach a zip folder for a test case to check reproducibility. case.zip. The case can be run with turboneko.

Config

On Dardel GPU

Modules:

Configuration: ./configure FC=ftn CC=cc MPIFC=ftn MPICC=cc HIPCC=hipcc --with-hip HIP_HIPCC_FLAGS=-O3 --offload-arch=gfx90a --enable-device-mpi --with-gslib=$GSLIB --host=x86_64-pc-linux-gnu

LUMI-G

Edit: Somehow I cannot reproduce it on LUMI anymore, or rather no segfault but it still freezes for a long time at fgslib_findpts_setup.

Configuration: ./configure --with-gslib=$GSLIB FC=ftn CC=cc HIPCC=hipcc MPIFC=ftn MPICC=cc --with-hip HIP_HIPCC_FLAGS=-O3 -x hip --offload-arch=gfx90a --enable-device-mpi

adperezm commented 4 months ago

Hej! I had issues with segfaults in find_points in instances where I was using a lot of ranks for the amount of elements I had. Could that be an issue for you, i.e., what happens when you reduce the number of ranks?

As I understand, fgslib_findpts_setup creates a hash mesh of the domain to determine rank candidates, etc. This would have nothing to do with the number of probes, more on how the elements in the domain are distributed I think. (I might be wrong)

There are probably some knobs to turn inside gslib in such cases, but it is good to confirm if we have the same.