Closed ndkeen closed 2 months ago
I think the land is so inexpensive in your SCREAM sims that you would hardly notice any degradation in the performance of the overall simulation.
Can someone confirm the the source of these broadcast calls?
If it's coming from pio_inq
calls, perhaps revisit logic to reduce number of calls there?
If nobody knows, then using a tool like https://github.com/LLNL/mpiP could help.
https://github.com/E3SM-Project/E3SM/pull/5690: Looks like Gautam's new decomposition doesn't have an impact on reducing them.
@PeterCaldwell, @elynnwu, I have #5690 that adds a new ELM domain decomposition algorithm and it can be activated by the following change:
cat >> user_nl_elm << EOF domain_decomp_type = 'simple' EOF
I have only tested the PR with one thread being used. Would you like to test the branch to see if it fixes the issue?
I ran this branch but the hang was still there.
I've also tried adding mpi_barrier inside restartvar
, which we think was the cause of the bcast flooding, but no luck either.
Summarizing few key points from a slack conversation:
@elynnwu Could you try cherry-picking my change in https://github.com/E3SM-Project/E3SM/pull/5699 to see if that fixes the issue?
It appears that hist_empty_htapes
is the culprit here. I had it set to true and it was causing the hang. @whannah1 will open a separate issue addressing this.
I occasionally see (various) jobs hang on pm (pm-cpu or pm-gpu) and I've been trying to debug in general, but this one seems different. If I run a basic ne30 case asking for a restart and end of case -- it seems to work. But if I ask for 2 restarts in the same job submission, it's often hanging on the second restart write. And so it's always been while trying to write a file such as:
f30cpu.F2010-SCREAMv1.ne30pg2_ne30pg2.bspa.gnu.n022a128x1cXi2.pk1.30d.wr.elm.r.0001-01-11-00000.nc
I also encountered a hang on the first restart attempt, but I think that was after reading in a restart (ie using CONTINUE_RUN=TRUE).
I've tried a few different scenarios, but here is a case where I asked to run for 30 days, with restarts every 5th day.
/global/cfs/cdirs/e3sm/ndk/e3sm_scratch/pm-cpu/bspa/f30cpu.F2010-SCREAMv1.ne30pg2_ne30pg2.bspa.gnu.n022a128x1cXi2.pk1.30d.wr