aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
435 stars 188 forks source link

verdi process list/status: be more verbose #4847

Open dev-zero opened 3 years ago

dev-zero commented 3 years ago

I am currently trying different workchains on different hosts, some with a queuing system, some without. One problem I frequently see is that daemon processes wait or hang. Example:

$ verdi process list
  PK  Created    Process label      Process State    Process status
----  ---------  -----------------  ---------------  ----------------------------------------------------
 884  4D ago     Cp2kTestWorkChain   ⏵ Waiting        Waiting for child processes: 888, 892, 896, 900, 904
 888  4D ago     Cp2kBaseWorkChain  ⏵ Waiting        Waiting for child processes: 905
 892  4D ago     Cp2kBaseWorkChain  ⏵ Waiting        Waiting for child processes: 906
 896  4D ago     Cp2kBaseWorkChain  ⏵ Waiting        Waiting for child processes: 907
 900  4D ago     Cp2kBaseWorkChain  ⏵ Waiting        Waiting for child processes: 908
 904  4D ago     Cp2kBaseWorkChain  ⏵ Waiting        Waiting for child processes: 909
 905  4D ago     Cp2kCalculation    ⏵ Waiting        Monitoring scheduler: job state RUNNING
 906  4D ago     Cp2kCalculation    ⏵ Waiting        Monitoring scheduler: job state RUNNING
 907  4D ago     Cp2kCalculation    ⏵ Waiting        Monitoring scheduler: job state RUNNING
 908  4D ago     Cp2kCalculation    ⏵ Waiting        Monitoring scheduler: job state RUNNING
 909  4D ago     Cp2kCalculation    ⏵ Waiting        Monitoring scheduler: job state RUNNING

Total results: 11

Info: last time an entry changed state: 4D ago (at 12:16:45 on 2021-04-01)

$ verdi process status 888
Cp2kBaseWorkChain<888> Waiting [1:while_(should_run_process)]
    └── Cp2kCalculation<905> Waiting

$ verdi process status 905
Cp2kCalculation<905> Waiting

$ verdi process report 905
*** 905: CalcJobState.WITHSCHEDULER, scheduler state: JobState.RUNNING
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 0 LOG MESSAGES

$ verdi process report 888
2021-04-01 14:15:26 [465 | REPORT]: [888|Cp2kBaseWorkChain|run_process]: launching Cp2kCalculation<905> iteration #1

The calculations finished within a day on the remote node (running without a queueing system), but the machine running AiiDA got disconnected and put into sleep in-between. I ran the workchain with verdi run .... Pausing the process does not work, the spinner stops spinning after some time and verdi process pause hangs. Restarting the daemon does work but doesn't change anything. verdi daemon logshow doesn't show anything,

What I can't figure out based on the messages shown in status or report is: what is it waiting for? When did it last check whether the process is still running on the remote host? If it did check (probably didn't though because I should've gotten a key unlock popup dialog from the ssh agent), why did it conclude that the process is still running? When is the next check? Can I force a re-check on the host and how?

chrisjsewell commented 3 years ago

Can you provide what aiida-core version you are running on

dev-zero commented 3 years ago

v1.5.2-128-g5b3e555c4 (develop), but also a git pull --rebase && pip install -e . && verdi daemon restart did nothing. So, now it is: v1.5.2-136-gdc9ec6f91. Is there a reason why an un-annotated tag is used for v1.6.0+? Note that git describe will by default use only annotated tags.

dev-zero commented 3 years ago

maybe some more details, now that I have cancelled the verdi run I got the following:

^C04/01/2021 03:32:38 PM <73540> paramiko.transport: [ERROR] Exception: Error reading SSH protocol banner
04/06/2021 03:26:22 PM <73540> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 884
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] Traceback (most recent call last):
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR]   File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/paramiko/transport.py", line 2211, in _check_banner
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR]     buf = self.packetizer.readline(timeout)
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR]   File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/paramiko/packet.py", line 380, in readline
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR]     buf += self._read_timeout(timeout)
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR]   File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/paramiko/packet.py", line 622, in _read_timeout
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR]     raise socket.timeout()
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] socket.timeout
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] 
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] During handling of the above exception, another exception occurred:
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] 
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] Traceback (most recent call last):
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR]   File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/paramiko/transport.py", line 2039, in run
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR]     self._check_banner()
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR]   File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/paramiko/transport.py", line 2215, in _check_banner
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR]     raise SSHException(
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] paramiko.ssh_exception.SSHException: Error reading SSH protocol banner
04/06/2021 03:26:22 PM <73540> paramiko.transport: [ERROR] 
04/01/2021 04:57:47 PM <73540> aiida.transport.SshTransport: [ERROR] Error connecting to 'tcopt13' through SSH: [SshTransport] No existing session, connect_args were: {'username': 'tiziano', 'port': 22, 'look_for_keys': True, 'timeout': 60, 'allow_agent': True, 'compress': True, 'gss_auth': False, 'gss_kex': False, 'gss_deleg_creds': False, 'gss_host': 'tcopt13', 'sock': <aiida.transports.util._DetachedProxyCommand object at 0x7f394b877730>}
04/06/2021 03:26:22 PM <73540> aiida.engine.transports: [ERROR] exception occurred while trying to open transport:
 No existing session
04/06/2021 03:26:23 PM <73540> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/home/tiziano/work/aiida/aiida_core/aiida/engine/transports.py", line 110, in request_transport
    yield transport_request.future
  File "/home/tiziano/work/aiida/aiida_core/aiida/engine/processes/calcjobs/manager.py", line 98, in _get_jobs_from_scheduler
    transport = await request
  File "/usr/lib/python3.9/asyncio/futures.py", line 284, in __await__
    yield self  # This tells Task to wait for completion.
  File "/usr/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
    future.result()
  File "/usr/lib/python3.9/asyncio/futures.py", line 201, in result
    raise self._exception
  File "/home/tiziano/work/aiida/aiida_core/aiida/engine/transports.py", line 89, in do_open
    transport.open()
  File "/home/tiziano/work/aiida/aiida_core/aiida/transports/plugins/ssh.py", line 438, in open
    self._client.connect(self._machine, **connection_arguments)
  File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/paramiko/client.py", line 412, in connect
    server_key = t.get_remote_server_key()
  File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/paramiko/transport.py", line 834, in get_remote_server_key
    raise SSHException("No existing session")
paramiko.ssh_exception.SSHException: No existing session

Traceback (most recent call last):
  File "/home/tiziano/.local/share/virtualenvs/aiida/bin/verdi", line 12, in <module>
    sys.exit(verdi())
  File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/tiziano/work/aiida/aiida_core/aiida/cmdline/utils/decorators.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/tiziano/work/aiida/aiida_core/aiida/cmdline/commands/cmd_run.py", line 147, in run
    exec(compile(handle.read(), scriptname, 'exec', dont_inherit=True), globals_dict)  # yapf: disable # pylint: disable=exec-used
  File "examples/workchains/example_test_workchain.py", line 156, in <module>
    cli()  # pylint: disable=no-value-for-parameter
  File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "examples/workchains/example_test_workchain.py", line 152, in cli
    example_test(code, mpiprocs_per_machine)
  File "examples/workchains/example_test_workchain.py", line 136, in example_test
    calc = run(builder)
  File "/home/tiziano/work/aiida/aiida_core/aiida/engine/launch.py", line 41, in run
    return runner.run(process, *args, **inputs)
  File "/home/tiziano/work/aiida/aiida_core/aiida/engine/runners.py", line 266, in run
    result, _ = self._run(process, *args, **inputs)
  File "/home/tiziano/work/aiida/aiida_core/aiida/engine/runners.py", line 250, in _run
    process_inited.execute()
  File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/plumpy/processes.py", line 81, in func_wrapper
    return func(self, *args, **kwargs)
  File "/home/tiziano/.local/share/virtualenvs/aiida/lib/python3.9/site-packages/plumpy/processes.py", line 1166, in execute
    return self.future().result()
plumpy.exceptions.KilledError: Process was killed because the runner received an interrupt

The culprit seems to be Paramiko, hanging indefinitely... yet it would already help a lot if there was a way to get more information from aiida on what is going on.

sphuber commented 3 years ago

If it did check (probably didn't though because I should've gotten a key unlock popup dialog from the ssh agent), why did it conclude that the process is still running?

Do the keys that you have configured for this particular remote computer require a password or for you to unlock them occasionally through the agent as you say? Because that may very well be at the root of the problem. If the daemon cannot use the keys automatically to connect or it is through some session that can expire, then when it does expire, it can no longer connect. The only way of fixing it is to restart the daemon so it can refresh the session.

Then: information of a runner checking with the scheduler what the current status of a job is, is logged, however, the majority is done at either the INFO or DEBUG level. If you want to see them, use verdi config logging.aiida_loglevel INFO to set them. Once you restart the daemon the changes will take effect. These log messages can be seen with verdi process report where all log messages are displayed. The reason we do not log them by default is because for long running jobs, the scheduler state is updated very often and this information usually not needed and just bloats the database.

Alternatively, the result of the last query is stored in the attributes. So you can use verdi node attributes to show them. Probably the keys last_job_info and scheduler_lastchecktime would be useful here.

dev-zero commented 3 years ago

@sphuber thanks a lot for the detailed analysis. Indeed I have an agent which after some time unloads the keys. And since I was running interactively/blocking via verdi run it seems that the SSH connection was made in the verdi run process, is that correct? Hence restarting the daemon did not help to get things running again. What is also rather strange is that Paramiko was hanging in there forever and doesn't allow to setting a timeout for that part of the connection initialization.

sphuber commented 3 years ago

Indeed, when using verdi run everything happens in that Python interpreter and the daemon is not involved whatsoever. It seems that the agent unloading the keys is the problem as there is no automated mechanism within paramiko or aiida-core to refresh the session. In this case that wouldn't even be possible anyway since you need to unlock the key through a passphrase.

dev-zero commented 3 years ago

Well, the agent unloading the keys may be cause, yes, but paramiko should definitely try to reconnect, at which point the agent should give me the "enter password" popup again. Or bailout completely (less optimal). And having the key not loaded even works at the beginning, it is only later where a key unload (which I think I have setup to happen automatically when going into sleep mode) triggers a hang. But anyway, this would be something for a different issue. This one is about not having any visible state (still have to try the suggestions you gave above) on what is going on.