Hi, I'm trying to run experiments with EGG on Slurm, but when I try to use core.init, EGG tries to interact with Slurm directly through the maybe_init_distributed function in the egg/core/distributed.py script using the following lines:
However, the job is assigned to and running on a separate node already which does not have Slurm installed itself, so I receive an error.
Expected Behavior
I believe that a Slurm job shouldn't try to directly interact with Slurm commands on a node that might not have Slurm installed.
Current Behavior
I receive the following error message:
Traceback (most recent call last):
File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/cs.aau.dk/ic18eg/multi_agent_diversity/main.py", line 27, in <module>
opts = core.init(
File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/site-packages/egg/core/util.py", line 175, in init
common_opts = _get_params(arg_parser, params)
File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/site-packages/egg/core/util.py", line 141, in _get_params
args.distributed_context = maybe_init_distributed(args)
File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/site-packages/egg/core/distributed.py", line 68, in maybe_init_distributed
hostnames = subprocess.check_output(
File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/subprocess.py", line 421, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/subprocess.py", line 503, in run
with Popen(*popenargs, **kwargs) as process:
File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/subprocess.py", line 971, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/home/cs.aau.dk/ic18eg/.conda/envs/multi_agent_diversity/lib/python3.10/subprocess.py", line 1863, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'scontrol'
Steps to Reproduce
Provide parameters in a script using the core.init function, for instance:
Hi, I'm trying to run experiments with EGG on Slurm, but when I try to use
core.init
, EGG tries to interact with Slurm directly through themaybe_init_distributed
function in theegg/core/distributed.py
script using the following lines:However, the job is assigned to and running on a separate node already which does not have Slurm installed itself, so I receive an error.
Expected Behavior
I believe that a Slurm job shouldn't try to directly interact with Slurm commands on a node that might not have Slurm installed.
Current Behavior
I receive the following error message:
Steps to Reproduce
core.init
function, for instance:Detailed Description
Either I'm missing something, or the library fails on nodes that do not have Slurm installed on them.
Thanks for the help!