aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
435 stars 188 forks source link

The daemon running but not digest any process #6137

Closed unkcpz closed 8 months ago

unkcpz commented 1 year ago

I had calcjobs in the process list which are finished remotely but the item is kept in the process list and the daemon does not proceed on any of them. I tried verdi dever rabbitmq tasks analyze --fix and it revived some processes and it shows "No inconsistencies detected between database and RabbitMQ".

I guess the daemon must be busy doing something but I just cannot see any of it and those finished processes are just stuck on running or waiting status forever. I also tried to increase the log level to DEBUG and no new information shows after processes are loaded.

sphuber commented 1 year ago

Is there anything in verdi process report? Maybe there is a problem with the transport and it is cycling in the exponential backoff mechanism?

unkcpz commented 1 year ago

No, nothing there.

unkcpz commented 1 year ago

I guess the daemon must be busy doing something

I get such a conclusion because I start 4 daemons and then stop, there are always two daemon stop timeouts.

sphuber commented 1 year ago

Can you stop the daemon and then run verdi devel rabbitmq tasks list and check that the pks of the processes that seem to be stuck are listed. Another thing you can do, set verdi config set logging.aiida_loglevel INFO and then run verdi daemon worker. It should launch a single worker in the foreground and then you should see messages for each process that it starts running. Please make sure that the processes that seem stuck are logged there.

unkcpz commented 1 year ago

The processes are in the list verdi devel rabbitmq tasks list.

I changed the log-level to INFO and ran verdi daemon worker, one hour passed and nothing showed up.

sphuber commented 1 year ago

That's really weird. How many processes are listed by verdi devel rabbitmq tasks list?

unkcpz commented 1 year ago
╰─± verdi devel rabbitmq tasks list | wc -l 
206

Is this too much? The time these calcjobs froze happened around 2 days ago around when the CSCS token expired during the weekend. I was expecting this morning when I came back that some of the jobs run remotely should be paused after hitting the maximum iteration of exponential backoff. However, the processes are not all calcjobs run remotely, I also have calcjob that run locally that don't have SSH connection problems are stuck.

sphuber commented 8 months ago

Closing this for now since it is unlikely that you can reproduce this exact problem