E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
73 stars 53 forks source link

ELM hangs while writing output/restarts due to MPI_bcast flooding #1920

Closed ndkeen closed 2 months ago

ndkeen commented 2 years ago

I occasionally see (various) jobs hang on pm (pm-cpu or pm-gpu) and I've been trying to debug in general, but this one seems different. If I run a basic ne30 case asking for a restart and end of case -- it seems to work. But if I ask for 2 restarts in the same job submission, it's often hanging on the second restart write. And so it's always been while trying to write a file such as:

f30cpu.F2010-SCREAMv1.ne30pg2_ne30pg2.bspa.gnu.n022a128x1cXi2.pk1.30d.wr.elm.r.0001-01-11-00000.nc

I also encountered a hang on the first restart attempt, but I think that was after reading in a restart (ie using CONTINUE_RUN=TRUE).

I've tried a few different scenarios, but here is a case where I asked to run for 30 days, with restarts every 5th day.

/global/cfs/cdirs/e3sm/ndk/e3sm_scratch/pm-cpu/bspa/f30cpu.F2010-SCREAMv1.ne30pg2_ne30pg2.bspa.gnu.n022a128x1cXi2.pk1.30d.wr

bishtgautam commented 1 year ago

I think the land is so inexpensive in your SCREAM sims that you would hardly notice any degradation in the performance of the overall simulation.

sarats commented 1 year ago

Can someone confirm the the source of these broadcast calls? If it's coming from pio_inq calls, perhaps revisit logic to reduce number of calls there?

If nobody knows, then using a tool like https://github.com/LLNL/mpiP could help.

https://github.com/E3SM-Project/E3SM/pull/5690: Looks like Gautam's new decomposition doesn't have an impact on reducing them.

elynnwu commented 1 year ago

@PeterCaldwell, @elynnwu, I have #5690 that adds a new ELM domain decomposition algorithm and it can be activated by the following change:

cat >> user_nl_elm << EOF
domain_decomp_type = 'simple'
EOF

I have only tested the PR with one thread being used. Would you like to test the branch to see if it fixes the issue?

I ran this branch but the hang was still there.

I've also tried adding mpi_barrier inside restartvar, which we think was the cause of the bcast flooding, but no luck either.

bishtgautam commented 1 year ago

Summarizing few key points from a slack conversation:

bishtgautam commented 1 year ago

@elynnwu Could you try cherry-picking my change in https://github.com/E3SM-Project/E3SM/pull/5699 to see if that fixes the issue?

elynnwu commented 1 year ago

It appears that hist_empty_htapes is the culprit here. I had it set to true and it was causing the hang. @whannah1 will open a separate issue addressing this.