Libensemble / libensemble

A Python toolkit for coordinating asynchronous and dynamic ensembles of calculations.
https://libensemble.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
64 stars 25 forks source link

Crash of example on Perlmutter head node #1388

Open n01r opened 2 months ago

n01r commented 2 months ago

I ran the first script from this optimas example on the Perlmutter head node since there is only one active worker thread. However, libEnsemble detects successfully that I am running on Perlmutter but then it cannot detect a SLURM job partition because I was not running in a job.

This could either be fixed, or a warning should indicate that users should rather use compute jobs. When I did, everything worked.

Traceback (most recent call last):
  File "/global/cfs/cdirs/m4272/mgarten/sw/perlmutter/gpu/venvs/optimas-wake-t/lib/python3.11/site-packages/libensemble/resources/platforms.py", line 312, in known_system_detect
    name = detect_systems[domain_name]
           ~~~~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: 'chn.perlmutter.nersc.gov'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/pscratch/sd/m/mgarten/optimas/01_test_error_propagation/./run_test.py", line 40, in <module>
    exploration.run()
  File "/global/cfs/cdirs/m4272/mgarten/sw/perlmutter/gpu/venvs/optimas-wake-t/lib/python3.11/site-packages/optimas/explorations/base.py", line 212, in run
    history, persis_info, flag = libE(
                                 ^^^^^
  File "/global/cfs/cdirs/m4272/mgarten/sw/perlmutter/gpu/venvs/optimas-wake-t/lib/python3.11/site-packages/pydantic/validate_call_decorator.py", line 59, in wrapper_function
    return validate_call_wrapper(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m4272/mgarten/sw/perlmutter/gpu/venvs/optimas-wake-t/lib/python3.11/site-packages/pydantic/_internal/_validate_call.py", line 81, in __call__
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m4272/mgarten/sw/perlmutter/gpu/venvs/optimas-wake-t/lib/python3.11/site-packages/libensemble/libE.py", line 247, in libE
    platform_info = get_platform(libE_specs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m4272/mgarten/sw/perlmutter/gpu/venvs/optimas-wake-t/lib/python3.11/site-packages/libensemble/resources/platforms.py", line 329, in get_platform
    name = libE_specs.get("platform") or os.environ.get("LIBE_PLATFORM") or known_system_detect()
                                                                            ^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m4272/mgarten/sw/perlmutter/gpu/venvs/optimas-wake-t/lib/python3.11/site-packages/libensemble/resources/platforms.py", line 314, in known_system_detect
    name = known_envs()
           ^^^^^^^^^^^^
  File "/global/cfs/cdirs/m4272/mgarten/sw/perlmutter/gpu/venvs/optimas-wake-t/lib/python3.11/site-packages/libensemble/resources/platforms.py", line 295, in known_envs
    if "gpu_" in os.environ.get("SLURM_JOB_PARTITION"):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: argument of type 'NoneType' is not iterable
shuds13 commented 2 months ago

This logic certainly needs to be more robust. If the there is no SLURM_JOB_PARTITION we could default to "perlmutter_c" settings or just "perlmutter" and give a warning.

n01r commented 2 months ago

Right, of course, it is good practice to do any computation on a compute node. Often, one would still try to run a little test on the head node if not much computation is involved (especially since the head node has an internet connection and one can install missing packages, etc.).

So, a fallback option and a warning would make sure that users will not be confused. :)

shuds13 commented 2 months ago

I've updated the logic in #1391 I will test when Perlmutter is back from maintenance.

n01r commented 2 months ago

Thanks, Stephen!

jlnav commented 1 month ago

Was this addressed by the recent release?