AgnostiqHQ / covalent-slurm-plugin

Executor plugin interfacing Covalent with Slurm
https://covalent.xyz
Apache License 2.0
27 stars 6 forks source link

`self._remote_func_filename` is not defined when a SLURM job hits the walltime #49

Open Andrew-S-Rosen opened 1 year ago

Andrew-S-Rosen commented 1 year ago

Environment

What is happening?

I tried submitting a SLURM job and got the following traceback.

Traceback (most recent call last):
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent/executor/base.py", line 452, in execute
result = await self.run(function, args, kwargs, task_metadata)
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_slurm_plugin/slurm.py", line 474, in run
await self._poll_slurm(slurm_job_id, conn)
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_slurm_plugin/slurm.py", line 333, in _poll_slurm
raise RuntimeError("Job failed with status:\n", status)
RuntimeError: ('Job failed with status:\n', '')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_dispatcher/_core/runner.py", line 293, in _run_task
output, stdout, stderr, exception_raised = await executor._execute(
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent/executor/base.py", line 421, in _execute
return await self.execute(
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent/executor/base.py", line 459, in execute
await self.teardown(task_metadata=task_metadata)
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_slurm_plugin/slurm.py", line 505, in teardown
remote_func_filename=self._remote_func_filename,
AttributeError: 'SlurmExecutor' object has no attribute '_remote_func_filename'

My guess (?) is that self._remote_func_filename is not defined since the RuntimeError was raised.

How can we reproduce the issue?

import covalent as ct
import time

executor = ct.executor.SlurmExecutor(<redacted>)

@ct.lattice
@ct.electron(executor=executor)
def add(val1,val2):
    time.sleep(10000) # make sure the walltime is less than this
    return val1+val2

dispatch_id = ct.dispatch(add)(1,2)
result = ct.get_result(dispatch_id,wait=True)
print(result)

What should happen?

The covalent task should abort gracefully.

Any suggestions?

I think this error happens anytime the job dies unexpectedly (e.g. hits the walltime or otherwise). It doesn't seem to "terminate gracefully."

Addendum

It seems that adding the parsable: "" option fixes the lack of a returned status but otherwise the same issue arises.