AMGX spuriously hangs for large (but < 2B DOF) problem sizes

KrausAdam commented 2 years ago

I am running a case at roughly 3M elements on Summit, using AMGX for pressure solve. When I start at N=3, everything works fine. If I push to N=5 (or even N=4), the code will hang without throwing any exception at the following point:

================ ELLIPTIC SETUP PRESSURE ================
allNeumann = 0
loading elliptic kernels ... done (0.000201542s)
timing oogs modes: 0.000454582s 0.00039569s 0.000387085s 0.00040328s 0.000290769s 0.000263196s used config: 3.0.1
timing oogs modes: 0.000820908s 0.000933383s 0.000930678s 0.000930687s 0.000584654s 0.000575225s used config: 3.0.1
setup SEMFEM preconditioner ...
building matrix ... done (17.3861s)
AMGX version 2.2.0.132-opensource
Built on Jan 6 2022, 12:20:46
Compiled with CUDA Runtime 11.0, using CUDA driver 11.0
Using CUDA-Aware MPI (GPU Direct) communicator...

To Reproduce My .par settings are:

[PRESSURE]
residualTol = 1e-5
residualProj = yes
residualProjectionVectors  = 30
preconditioner = semfem+amgx

[AMGX]
configFile = "amgx.json"

The amgx.json is taken from the kershaw example.

I tried up to 80 nodes for the N=5 case to try to rule out any memory issues but it didn't solve it.

stgeke commented 2 years ago

Can you please attach the debugger to some of the MPI proc to see where it deadlocks exactly?

KrausAdam commented 2 years ago

How do I do that? Never have

stgeke commented 2 years ago

Fixed in https://github.com/Nek5000/nekRS/commit/9ae887200454a3409e8a0c324bde9989815716c3

Nek5000 / nekRS

AMGX spuriously hangs for large (but < 2B DOF) problem sizes #420