AgnostiqHQ / covalent-slurm-plugin

Executor plugin interfacing Covalent with Slurm
https://covalent.xyz
Apache License 2.0
27 stars 5 forks source link

Slurm electrons fails when called within a Dask sublattice which itself is called in a Dask lattice. #69

Open jackbaker1001 opened 1 year ago

jackbaker1001 commented 1 year ago

Environment

What is happening?

When running a slurm electron within a base (Dask) sublattice and dispatching the sublattice within a base (Dask) lattice, the dispatch will run on the remote cluster, finish the job, then fail when retieving the job. The traceback reported in the GUI is:

Traceback (most recent call last):
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent_dispatcher/_core/runner.py", line 251, in _run_task
output, stdout, stderr, status = await executor._execute(
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent/executor/base.py", line 628, in _execute
return await self.execute(
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent/executor/base.py", line 657, in execute
result = await self.run(function, args, kwargs, task_metadata)
File "/Users/jbaker/code/covalent/covalent-slurm-plugin/covalent_slurm_plugin/slurm.py", line 695, in run
result, stdout, stderr, exception = await self._query_result(
File "/Users/jbaker/code/covalent/covalent-slurm-plugin/covalent_slurm_plugin/slurm.py", line 577, in _query_result
async with aiofiles.open(stderr_file, "r") as f:
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/aiofiles/base.py", line 78, in __aenter__
self._obj = await self._coro
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/aiofiles/threadpool/__init__.py", line 80, in _open
f = yield from loop.run_in_executor(executor, cb)
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/jbaker/.local/share/covalent/data/ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091/stdout-ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091-0.log'

How can we reproduce the issue?

I am using the sshproxy extra req and have prepared my covalent config file as suggested in the root README.md.

Here's a simple workflow to reproduce the above:

import covalent as ct
import numpy as np

executor = ct.executor.SlurmExecutor(
       remote_workdir="<wdir>",
       options={
           "qos": "regular",
           "t": "00:05:00",
           "nodes": 1,
           "C": "gpu",
           "A": "<acc code>",
           "J": "bug_test",
           "ntasks-per-node": 4,
           "gpus-per-task": 1,
           "gpu-bind": "map_gpu:0,1,2,3"
       },
       prerun_commands=[
           "export COVALENT_CONFIG_DIR="<somewhere in scratch>",
           "export COVALENT_CACHE_DIR="<somewhere in scratch>",
           "export SLURM_CPU_BIND=\"cores\"",
           "export OMP_PROC_BIND=spread",
           "export OMP_PLACES=threads",
           "export OMP_NUM_THREADS=1",
       ],
       username="<username>",
       ssh_key_file="<key>",
       cert_file="<cert>",
       address="perlmutter-p1.nersc.gov",
       conda_env="<conda env>",
       use_srun=False
)

@ct.electron
def get_rand_sum_length(lo, hi):
    np.random.seed(1984)
    return np.random.randint(lo, hi)

# Slurm electron
@ct.electron(executor=executor)
def get_rand_num_slurm(lo, hi):
    np.random.seed(1984)
    return np.random.randint(lo, hi)  

@ct.electron
@ct.lattice
def add_n_random_nums(n, lo, hi):
    np.random.seed(1984)
    sum = 0
    for i in range(n):
        sum += get_rand_num_slurm(lo, hi)
    return sum

@ct.lattice
def random_num_workflow(lo, hi):
    n = get_rand_sum_length(lo, hi)
    sum = add_n_random_nums(n, lo, hi) # sublattice
    return sum

id = ct.dispatch(random_num_workflow)(1, 3)
ct_result = ct.get_result(dispatch_id=id, wait=True)
sum = ct_result.result
print(sum)

What should happen?

The code should run to completion, throwing now error in the GUI and print an integer.

Any suggestions?

It seems to me that the interaction between the Dask and Slurm executors is not quite right. Either way, the file Covalent is looking for exists on the remote directory in <wdir>/stdout-ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091-0.log but does not exist in the local directory /Users/jbaker/.local/share/covalent/data/ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091/stdout-ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091-0.log. Indeed, in /Users/jbaker/.local/share/covalent/data/ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091/ stdout files are contained within the /node/ subdirs.

santoshkumarradha commented 1 year ago

@cjao this seems like an edge case we need to look at, any recommended pattern for this ?

Andrew-S-Rosen commented 11 months ago

@jackbaker1001: Was this issue addressed by https://github.com/AgnostiqHQ/covalent/pull/1736?