STEllAR-GROUP / octotiger

Astrophysics program simulating the evolution of star systems based on the fast multipole method on adaptive Octrees
http://octotiger.stellar-group.org/
Boost Software License 1.0
48 stars 17 forks source link

Octo-Tiger thread pool exception when running on Perlmutter with 32 nodes. #485

Closed JiakunYan closed 5 months ago

JiakunYan commented 5 months ago

Expected Behavior

OctoTiger runs successfully.

Actual Behavior

HPX gives me an exception (below is the log of rank 32)

32: Starting main... 32: Registering functions ... 32: Starting hpx init ... 32: Check number of available GPUs... 32: Found 1 CUDA devices! 32: Initialize executors and masks... 32: Using Kokkos serial executors for multipole FMM kernels... 32: Using Kokkos serial executors for monopole FMM kernels... 32: Using Kokkos serial executors for hydro kernels... 32: Initializing cell_geometry 3 8 14 32: Registering HPX CUDA polling on polling pool... 32: ERROR: Caught HPX exception during initialization! 32: {what}: the resource partitioner does not own a thread pool named 'polling'.

It works fro 8 nodes and 16 nodes. I can only get this error with 32 nodes.

It happens after I used the newest master branch of Octo-Tiger and cppuddle and recent Perlmutter system upgrades. Before that, it was #473.

Steps to Reproduce the Problem

The spack spec I am using

octotiger@git.dd5cb880289f7bfca0de9f4a644b2f7370e98a81=master%gcc@12.3.0+cuda+kokkos cuda_arch=80 cppflags="-L/opt/cray/pe/mpich/8.1.28/gtl/lib -lmpi_gtl_cuda" 
^cppuddle@git.e4b42ba5e550c125aadc586f964126564efb76e6 max_number_gpus=4 
^hpx networking=lci,mpi max_cpu_count=256 ^cray-mpich 
^lci+examples+tests+benchmarks fabric=ofi default-pm=cray 
^silo~mpi
JiakunYan commented 5 months ago

I am sure this is a problem with the polling thread. I modified the code to let it print opts().polling_threads at the beginning of the function iinit_executors. Here is what I found

...
 27: init_executors: polling threads = 0
 29: init_executors: polling threads = 0
 30: init_executors: polling threads = 0
 31: init_executors: polling threads = 0
 32: init_executors: polling threads = 778658668
 33: init_executors: polling threads = 0
 34: init_executors: polling threads = 0
 35: init_executors: polling threads = 0
...

Occasionally, it is uninitialized.

Later, I found that options::process_options is only running on rank 0, and I did not find the code that broadcasts the options to other ranks, so the options on other ranks are never initialized and can be arbitrary.

tag @diehlpk @G-071

G-071 commented 5 months ago

The options get broadcast to other nodes here: https://github.com/STEllAR-GROUP/octotiger/blob/f95411da1d589c42198946dc93a24c15cbe21304/frontend/frontend-helper.cpp#L171

However, you are onto something here: I just checked the option code and the options are not getting serialized completely. In particular, the part with the polling_threads is missing there, so that is the likely culprit. Can you test if #488 resolves the issue?