facebookresearch / EGG

EGG: Emergence of lanGuage in Games
MIT License
281 stars 99 forks source link

Error with `core.init` when using Slurm #262

Closed vrmer closed 1 month ago

vrmer commented 1 month ago

Hi, I'm trying to run experiments with EGG on Slurm, but when I try to use core.init, EGG tries to interact with Slurm directly through the maybe_init_distributed function in the egg/core/distributed.py script using the following lines:

hostnames = subprocess.check_output(
    ["scontrol", "show", "hostnames", os.environ["SLURM_JOB_NODELIST"]]
)

However, the job is assigned to and running on a separate node already which does not have Slurm installed itself, so I receive an error.

Expected Behavior

I believe that a Slurm job shouldn't try to directly interact with Slurm commands on a node that might not have Slurm installed.

Current Behavior

I receive the following error message:

Traceback (most recent call last):
  File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/cs.aau.dk/ic18eg/multi_agent_diversity/main.py", line 27, in <module>
    opts = core.init(
  File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/site-packages/egg/core/util.py", line 175, in init
    common_opts = _get_params(arg_parser, params)
  File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/site-packages/egg/core/util.py", line 141, in _get_params
    args.distributed_context = maybe_init_distributed(args)
  File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/site-packages/egg/core/distributed.py", line 68, in maybe_init_distributed
    hostnames = subprocess.check_output(
  File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/subprocess.py", line 503, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/subprocess.py", line 1863, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'scontrol'

Steps to Reproduce

  1. Provide parameters in a script using the core.init function, for instance:
    opts = core.init(
    params=[
        f"--random_seed={config.random_seed}",
        f"--lr={config.lr}",
        f"--batch_size={config.batch_size}",
        f"--optimizer={config.optimizer}",
    ]
    )
  2. Run the script as a Slurm job.

Detailed Description

Either I'm missing something, or the library fails on nodes that do not have Slurm installed on them.


Thanks for the help!