Open dev-zero opened 3 years ago
Can you provide what aiida-core version you are running on
v1.5.2-128-g5b3e555c4 (develop), but also a git pull --rebase && pip install -e . && verdi daemon restart
did nothing. So, now it is: v1.5.2-136-gdc9ec6f91.
Is there a reason why an un-annotated tag is used for v1.6.0+? Note that git describe
will by default use only annotated tags.
maybe some more details, now that I have cancelled the verdi run
I got the following:
^C04/01/2021 03:32:38 PM <73540> paramiko.transport: [ERROR] Exception: Error reading SSH protocol banner
04/06/2021 03:26:22 PM <73540> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 884
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] Traceback (most recent call last):
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/paramiko/transport.py", line 2211, in _check_banner
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] buf = self.packetizer.readline(timeout)
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/paramiko/packet.py", line 380, in readline
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] buf += self._read_timeout(timeout)
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/paramiko/packet.py", line 622, in _read_timeout
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] raise socket.timeout()
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] socket.timeout
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR]
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] During handling of the above exception, another exception occurred:
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR]
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] Traceback (most recent call last):
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/paramiko/transport.py", line 2039, in run
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] self._check_banner()
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/paramiko/transport.py", line 2215, in _check_banner
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] raise SSHException(
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] paramiko.ssh_exception.SSHException: Error reading SSH protocol banner
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR]
04/01/2021 04:57:47 PM <73540> aiida.transport.SshTransport: [ERROR] Error connecting to 'tcopt13' through SSH: [SshTransport] No existing session, connect_args were: {'username': 'tiziano', 'port': 22, 'look_for_keys': True, 'timeout': 60, 'allow_agent': True, 'compress': True, 'gss_auth': False, 'gss_kex': False, 'gss_deleg_creds': False, 'gss_host': 'tcopt13', 'sock': <aiida.transports.util._DetachedProxyCommand object at 0x7f394b877730>}
04/06/2021 03:26:22 PM <73540> aiida.engine.transports: [ERROR] exception occurred while trying to open transport:
No existing session
04/06/2021 03:26:23 PM <73540> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
File "/home/tiziano/work/aiida/aiida_core/aiida/engine/transports.py", line 110, in request_transport
yield transport_request.future
File "/home/tiziano/work/aiida/aiida_core/aiida/engine/processes/calcjobs/manager.py", line 98, in _get_jobs_from_scheduler
transport = await request
File "/usr/lib/python3.9/asyncio/futures.py", line 284, in __await__
yield self # This tells Task to wait for completion.
File "/usr/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
future.result()
File "/usr/lib/python3.9/asyncio/futures.py", line 201, in result
raise self._exception
File "/home/tiziano/work/aiida/aiida_core/aiida/engine/transports.py", line 89, in do_open
transport.open()
File "/home/tiziano/work/aiida/aiida_core/aiida/transports/plugins/ssh.py", line 438, in open
self._client.connect(self._machine, **connection_arguments)
File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/paramiko/client.py", line 412, in connect
server_key = t.get_remote_server_key()
File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/paramiko/transport.py", line 834, in get_remote_server_key
raise SSHException("No existing session")
paramiko.ssh_exception.SSHException: No existing session
Traceback (most recent call last):
File "/home/tiziano/.local/share/virtualenvs/aiida/bin/verdi", line 12, in <module>
sys.exit(verdi())
File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/tiziano/work/aiida/aiida_core/aiida/cmdline/utils/decorators.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/home/tiziano/work/aiida/aiida_core/aiida/cmdline/commands/cmd_run.py", line 147, in run
exec(compile(handle.read(), scriptname, 'exec', dont_inherit=True), globals_dict) # yapf: disable # pylint: disable=exec-used
File "examples/workchains/example_test_workchain.py", line 156, in <module>
cli() # pylint: disable=no-value-for-parameter
File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "examples/workchains/example_test_workchain.py", line 152, in cli
example_test(code, mpiprocs_per_machine)
File "examples/workchains/example_test_workchain.py", line 136, in example_test
calc = run(builder)
File "/home/tiziano/work/aiida/aiida_core/aiida/engine/launch.py", line 41, in run
return runner.run(process, *args, **inputs)
File "/home/tiziano/work/aiida/aiida_core/aiida/engine/runners.py", line 266, in run
result, _ = self._run(process, *args, **inputs)
File "/home/tiziano/work/aiida/aiida_core/aiida/engine/runners.py", line 250, in _run
process_inited.execute()
File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/plumpy/processes.py", line 81, in func_wrapper
return func(self, *args, **kwargs)
File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/plumpy/processes.py", line 1166, in execute
return self.future().result()
plumpy.exceptions.KilledError: Process was killed because the runner received an interrupt
The culprit seems to be Paramiko, hanging indefinitely... yet it would already help a lot if there was a way to get more information from aiida on what is going on.
If it did check (probably didn't though because I should've gotten a key unlock popup dialog from the ssh agent), why did it conclude that the process is still running?
Do the keys that you have configured for this particular remote computer require a password or for you to unlock them occasionally through the agent as you say? Because that may very well be at the root of the problem. If the daemon cannot use the keys automatically to connect or it is through some session that can expire, then when it does expire, it can no longer connect. The only way of fixing it is to restart the daemon so it can refresh the session.
Then: information of a runner checking with the scheduler what the current status of a job is, is logged, however, the majority is done at either the INFO or DEBUG level. If you want to see them, use verdi config logging.aiida_loglevel INFO
to set them. Once you restart the daemon the changes will take effect. These log messages can be seen with verdi process report
where all log messages are displayed. The reason we do not log them by default is because for long running jobs, the scheduler state is updated very often and this information usually not needed and just bloats the database.
Alternatively, the result of the last query is stored in the attributes. So you can use verdi node attributes
to show them. Probably the keys last_job_info
and scheduler_lastchecktime
would be useful here.
@sphuber thanks a lot for the detailed analysis. Indeed I have an agent which after some time unloads the keys. And since I was running interactively/blocking via verdi run
it seems that the SSH connection was made in the verdi run
process, is that correct? Hence restarting the daemon did not help to get things running again.
What is also rather strange is that Paramiko was hanging in there forever and doesn't allow to setting a timeout for that part of the connection initialization.
Indeed, when using verdi run
everything happens in that Python interpreter and the daemon is not involved whatsoever. It seems that the agent unloading the keys is the problem as there is no automated mechanism within paramiko or aiida-core
to refresh the session. In this case that wouldn't even be possible anyway since you need to unlock the key through a passphrase.
Well, the agent unloading the keys may be cause, yes, but paramiko should definitely try to reconnect, at which point the agent should give me the "enter password" popup again. Or bailout completely (less optimal). And having the key not loaded even works at the beginning, it is only later where a key unload (which I think I have setup to happen automatically when going into sleep mode) triggers a hang. But anyway, this would be something for a different issue. This one is about not having any visible state (still have to try the suggestions you gave above) on what is going on.
I am currently trying different workchains on different hosts, some with a queuing system, some without. One problem I frequently see is that daemon processes wait or hang. Example:
The calculations finished within a day on the remote node (running without a queueing system), but the machine running AiiDA got disconnected and put into sleep in-between. I ran the workchain with
verdi run ...
. Pausing the process does not work, the spinner stops spinning after some time andverdi process pause
hangs. Restarting the daemon does work but doesn't change anything.verdi daemon logshow
doesn't show anything,What I can't figure out based on the messages shown in
status
orreport
is: what is it waiting for? When did it last check whether the process is still running on the remote host? If it did check (probably didn't though because I should've gotten a key unlock popup dialog from the ssh agent), why did it conclude that the process is still running? When is the next check? Can I force a re-check on the host and how?