Closed JiakunYan closed 5 months ago
I am sure this is a problem with the polling thread. I modified the code to let it print opts().polling_threads
at the beginning of the function iinit_executors
. Here is what I found
...
27: init_executors: polling threads = 0
29: init_executors: polling threads = 0
30: init_executors: polling threads = 0
31: init_executors: polling threads = 0
32: init_executors: polling threads = 778658668
33: init_executors: polling threads = 0
34: init_executors: polling threads = 0
35: init_executors: polling threads = 0
...
Occasionally, it is uninitialized.
Later, I found that options::process_options
is only running on rank 0, and I did not find the code that broadcasts the options to other ranks, so the options on other ranks are never initialized and can be arbitrary.
tag @diehlpk @G-071
The options get broadcast to other nodes here: https://github.com/STEllAR-GROUP/octotiger/blob/f95411da1d589c42198946dc93a24c15cbe21304/frontend/frontend-helper.cpp#L171
However, you are onto something here: I just checked the option code and the options are not getting serialized completely. In particular, the part with the polling_threads
is missing there, so that is the likely culprit. Can you test if #488 resolves the issue?
Expected Behavior
OctoTiger runs successfully.
Actual Behavior
HPX gives me an exception (below is the log of rank 32)
It works fro 8 nodes and 16 nodes. I can only get this error with 32 nodes.
It happens after I used the newest master branch of Octo-Tiger and cppuddle and recent Perlmutter system upgrades. Before that, it was #473.
Steps to Reproduce the Problem
The spack spec I am using