E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
76 stars 55 forks source link

PSTRID process striding broken in EAMxx gpu cases #3019

Open amametjanov opened 3 weeks ago

amametjanov commented 3 weeks ago

I'm getting run-time property-check errors with non-default PSTRID and hoping someone can take look.

./cime/scripts/create_test SMS_D_P32x1.ne4_ne4.F2000-SCREAMv1-AQP1.pm-gpu_gnugpu.scream-output-preset-2

runs fine by default on 8 nodes at 4 tasks/node. If I set process stride PSTRID=16 (also 4 tasks/node at 8 nodes)

./preview_run && ./pelayout
./xmlchange MAX_MPITASKS_PER_NODE=64
./xmlchange PSTRID=16
./case.setup -r
./preview_run && ./pelayout

I get errors below. A similar case works fine on CPUs:

/pscratch/sd/a/azamat/e3sm_scratch/pm-gpu/SMS_D_P32x1.ne4pg2_oQU480.WCYCLXX2010.pm-gpu_gnu.20240905/run-02-8x4x1-pstrid16-ok-2.624sypd/

Error:

  0: Using memory pool. Initial size: 4.92383GB ;  Grow size: 4.92383GB.
  0: NVIDIA A100-SXM4-40GB
  0: INFORM: Automatically inserting fence() after every parallel_for
  0: bfbhash>              0 8d32ee02e0000000 (Hommexx)
  0:
  0:  FAIL:
  0: false
  0: /global/u2/a/azamat/saul/scream/components/eamxx/src/share/atm_process/atmosphere_process.cpp:455
  0: Error! Failed post-condition property check (cannot be repaired).
  0:   - Atmosphere process name: p3
  0:   - Property check name: T_mid within interval [100, 500]
  0:   - Atmosphere process MPI Rank: 0
  0:   - Message: Check failed.
  0:   - check name: T_mid within interval [100, 500]
  0:   - field id: T_mid[Physics GLL] <double:ncol,lev>(30,72) [K]
  0:   - minimum:
  0:     - value: 1.46505e-09
  0:     - indices (w/ global column index): (106,16)
  0:     - lat/lon: (6.21885, 0)
  0:     - additional data (w/ local column index):
  0:
  0:      phis<ncol>(30)
  0:
  0:   phis(2)
  0:     0,
  0:
  0:      landfrac<ncol>(30)
  0:
  0:   landfrac(2)
  0:     0,
  0:
  0:     END OF ADDITIONAL DATA
  0:
  0:   - maximum:
  0:     - value: 0.017285
  0:     - indices (w/ global column index): (106,71)
  0:     - lat/lon: (6.21885, 0)
  0:     - additional data (w/ local column index):
  0:
  0:      phis<ncol>(30)
  0:
  0:   phis(2)
  0:     0,
  0:
  0:      landfrac<ncol>(30)
  0:
  0:   landfrac(2)
  0:     0,
  0:
  0:     END OF ADDITIONAL DATA

Path to that run-dir:

/pscratch/sd/a/azamat/e3sm_scratch/pm-gpu/SMS_D_P32x1.ne4_ne4.F2000-SCREAMv1-AQP1.pm-gpu_gnugpu.scream-output-preset-2.20240923/run-02-err-8x4x1-pstrid16/

This is with Sep-4 version of master 42ab514913 .

PeterCaldwell commented 11 hours ago

Context - we need to change PSTRID to interleave atm and ocn processes on the same nodes, which would allow us to do coupled k-scale runs almost as quickly as we can do atm-only F cases right now. Thus I see this as a moderately high priority task.

bartgol commented 11 hours ago

Since sept 4th is a while ago, can we confirm first that the error still happens with current master?

Other thoughts: