E3SM-Project / e3sm-unified

A metapackage for a unified anaconda environment for analyzing results from the Energy Exascale Earth System Model (E3SM).
BSD 3-Clause "New" or "Revised" License
8 stars 8 forks source link

slow on NERSC compute nodes #86

Closed wagmanbe closed 3 years ago

wagmanbe commented 3 years ago

Hi, My E3SM diagnostics jobs aren't running. Could the e3sm unified environment be bogging it down?

Interactive jobs on NERSC knl and haswell slow to a crawl after I load the e3sm unified environment, e.g

`salloc --nodes=1 --partition=debug --time=00:30:00 -C knl

source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_e3sm_unified.sh`

After this, everything slows down and my diagnostic script hangs on the import statements.

These problems do not occur on the login node.

chengzhuzhang commented 3 years ago

I'm wondering it might not be an e3sm-unified problem. I just tried e3sm-diags from e3sm-unified through interactive jobs on haswell. It ran well. However knl has been problematic, it's a known issue: https://github.com/E3SM-Project/e3sm_diags/issues/314. @wagmanbe would you try it again on haswell? If it still gives trouble, could you share your run script and I will try reproduce.

xylar commented 3 years ago

Are you not seeing these problems on knl when you use an E3SM_Diags development environment? I have always found python packages to run slowly on knl, so I would be surprised if this is specific to E3SM-Unified but can investigate if it appears to be. But I agree that haswell is the recommended option for all python codes.

wagmanbe commented 3 years ago

It's affecting both knl and haswell. Maybe it's a NERSC issue? salloc --nodes=1 --partition=debug --time=00:20:00 -C haswell source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_e3sm_unified.sh <-- Hangs for minutes. python <--slow import os from acme_diags.parameter.core_parameter import CoreParameter <--hangs for minutes.

darincomeau commented 3 years ago

NERSC was having problems yesterday afternoon/evening with very slow compute node performance that a few of us experienced, and was posted on their status page: https://www.nersc.gov/live-status/motd/ There's no notice now, so I'd recommend trying again.

wagmanbe commented 3 years ago

Thank you, but this problem is occurring just the same today.

chengzhuzhang commented 3 years ago

In this case, I suspect that the compute node problem is still there. I tried similar commands as below yesterday afternoon and got the same behavior. But tried again much later yesterday, everything looked fine...

It's affecting both knl and haswell. Maybe it's a NERSC issue? salloc --nodes=1 --partition=debug --time=00:20:00 -C haswell source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_e3sm_unified.sh <-- Hangs for minutes. python <--slow import os from acme_diags.parameter.core_parameter import CoreParameter <--hangs for minutes.

wagmanbe commented 3 years ago

It's at least 10x faster this afternoon.