AgnostiqHQ / covalent-slurm-plugin

Executor plugin interfacing Covalent with Slurm
https://covalent.xyz
Apache License 2.0
27 stars 6 forks source link

SSH connection problems with latest asyncssh version (2.15.0) #97

Open mlpgwdg opened 2 weeks ago

mlpgwdg commented 2 weeks ago

Environment

Installed with conda

What is happening?

Recently (2 days ago as of time of writing this), asyncssh updated to v. 2.15.0. Something in this update seems to have broken the Covalent SLURM plugin. In particular, attempts at submitting jobs error out at around line 524 of slurm.py, right after using scp to copy the pickle files over to the remote server. Pickle files can be found on the remote server, but no other files after this point manage to be copied, nor are any SLURM jobs started. Errors are rather cryptic and seem to change, from "SSH connection closed" to NoneType errors from a failed asyncssh conn object. Error stack trace confirms the location of the error and the source being the asyncssh library. After setting log level to debug in covalent's config option, and checking error.log for the failed SLURM executor node, this error trace appears:

Traceback (most recent call last):
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/covalent_dispatcher/_core/runner.py", line 182, in _run_task
    output, stdout, stderr, status = await executor._execute(
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/covalent/executor/base.py", line 695, in _execute
    return await self.execute(
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/covalent/executor/base.py", line 724, in execute
    result = await self.run(function, args, kwargs, task_metadata)
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/covalent_slurm_plugin/slurm.py", line 592, in run
    remote_paths = await self._copy_files(
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/covalent_slurm_plugin/slurm.py", line 537, in _copy_files
    await asyncssh.scp(temp_g.name, (conn, remote_py_script_filename))
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/scp.py", line 1041, in scp
    reader, writer = await _start_remote(
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/scp.py", line 190, in _start_remote
    writer, reader, _ = await conn.open_session(command, encoding=None)
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/connection.py", line 4198, in open_session
    chan, session = await self.create_session(
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/connection.py", line 4173, in create_session
    session = await chan.create(session_factory, command, subsystem,
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/channel.py", line 1207, in create
    result = await self._make_request(b'exec', String(command))
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/channel.py", line 740, in _make_request
    return await waiter
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/connection.py", line 1329, in data_received
    while self._inpbuf and self._recv_handler():
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/connection.py", line 1594, in _recv_packet
    processed = handler.process_packet(pkttype, seq, packet)
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/packet.py", line 237, in process_packet
    self._packet_handlers[pkttype](self, pkttype, pktid, packet)
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/channel.py", line 656, in _process_request
    self._service_next_request()
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/channel.py", line 416, in _service_next_request
    result = cast(Optional[bool], handler(packet))
  File "/home/myusername/miniconda3/envs/covalent_env/lib/python3.8/site-packages/asyncssh/channel.py", line 1246, in _process_exit_status_request
    self._session.exit_status_received(status)
AttributeError: 'NoneType' object has no attribute 'exit_status_received'

For a temporary fix: Revert to asyncssh v. 2.14.0 (restarting the covalent server and such, as needed)

For a more permanent fix: Some updates are needed in the plug in's code to be compatible with the latest version of asyncssh.

How can we reproduce the issue?

What should happen?

Job should run correctly. Instead, it will error out with an SSH connection closed or mentions of "NoneType has no attribute 'exit_status_received'"

Any suggestions?

For a temporary fix: Revert to asyncssh v. 2.14.0 (restarting the covalent server and such, as needed)

For a more permanent fix: Some updates are needed in the plug in's code to be compatible with the latest version of asyncssh.