Closed unkcpz closed 1 year ago
Just find I encounter this before https://github.com/aiidateam/aiida-core/issues/4596 and also reported by @sphuber https://github.com/aiidateam/aiida-core/issues/1292
EDIT: According to what I reported in https://github.com/aiidateam/aiida-core/issues/4596, I need to restart not only the daemon but also restart DB services. Anyway it is very annoying issue prevent me from running "real" high-throughputs calculation, I have to using submission control script to make sure no more than 10 workchain run at the same time.
I am pretty sure you only need to reset the daemon, not the DB service. But I agree, this needs to be fixed. Let's continue discussion in the other issue
The problem is I do verdi process play -a
(after restart daemon and DB service) and all paused processes restarted but throw the same error after a run a while.
You do verdi daemon restart --reset
? Also, can you make sure that you don't have any "rogue" daemon processes running. Stop the daemon and then run ps aux | grep verdi
and make sure there are no daemon workers running. because if so, they might still be picking up the jobs and if they have the inconsistent session, they will produce the same error again.
@sphuber, I encounter it again and restart the daemon clearly, all the processes are back and working fine. Thanks! I guess maybe you are correct I didn't assure the daemon is fully restarted.
Describe the bug
When there are > 500 calcjobs in the process list, some processes quickly run into exceptions below,
verdi process play -a
not help.Steps to reproduce
Steps to reproduce the behavior:
Only happened when I submit 40 of my pseudopotential workchains, each one will spawn 100 small pw.x calculation. Therefore not easy to reproduce from scratch, but interestingly since in the process list I have many processes is the pausing state after 5 maximum attempts, I can reproduce with and submit 10 of my workchains.
Expected behavior
Your environment
Other relevant software versions, e.g. Postres & RabbitMQ
Additional context