E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
332 stars 334 forks source link

Turn off xpmem in OFED 5.8 on Chrysalis #6359

Closed rljacob closed 2 weeks ago

rljacob commented 4 weeks ago

Add env var to chrysalis to turn off xpmem when using the new OFED 5.8 network drivers. A bug in xpmem can leave nodes stuck in an unkillable state after a model crash.

[BFB]

rljacob commented 4 weeks ago

@amametjanov Can you check that this doesn't slow runs down? It did not on my test with an ne30 production coupled case.

github-actions[bot] commented 4 weeks ago

PR Preview Action v1.4.7 :---: :rocket: Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6359/ on branch gh-pages at 2024-04-23 03:07 UTC

rljacob commented 3 weeks ago

@amametjanov please try again with this new version that doesn't have the typo.

amametjanov commented 3 weeks ago

PFS.ne30pg2_r05_IcoswISC30E3r5.F2010.chrysalis_intel.bench-noio:

2024-04-19 20:09:14: MEMCOMP: Memory usage highwater changed by -3.54%: baseline=6373.210 MB, tolerance=5%, current=6147.640 MB
 ---------------------------------------------------
2024-04-19 20:09:14: TPUTCOMP: Throughput changed by 0.22%: baseline=1.791 sypd, tolerance=5%, current=1.787 sypd

PFS.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.chrysalis_intel.bench-noio:

2024-04-19 20:03:46: MEMCOMP: Memory usage highwater changed by -4.62%: baseline=4902.090 MB, tolerance=5%, current=4675.830 MB
 ---------------------------------------------------
2024-04-19 20:03:46: TPUTCOMP: Throughput changed by 0.60%: baseline=1.997 sypd, tolerance=5%, current=1.985 sypd
rljacob commented 3 weeks ago

This is now in the openmpi module by default so don't need to add it.

rljacob commented 3 weeks ago

Removed it from module. Was in place 2pm to 10pm April 22.

rljacob commented 2 weeks ago

revised title and comment because this variable is needed all the time, not just OpenMPI.