aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
433 stars 186 forks source link

Instance <DbAuthInfo> is not bound to a Session error #6024

Closed unkcpz closed 1 year ago

unkcpz commented 1 year ago

Describe the bug

When there are > 500 calcjobs in the process list, some processes quickly run into exceptions below, verdi process play -a not help.

+-> ERROR at 2023-05-17 00:08:14.484628+02:00
 | Traceback (most recent call last):
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
 |     result = await coro()
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 192, in do_update
 |     with job_manager.request_job_info_update(authinfo, job_id) as update_request:
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/contextlib.py", line 119, in __enter__
 |     return next(self.gen)
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 286, in request_job_info_update
 |     with self.get_jobs_list(authinfo).request_job_info_update(job_id) as request:
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/contextlib.py", line 119, in __enter__
 |     return next(self.gen)
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 167, in request_job_info_update
 |     self._ensure_updating()
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 195, in _ensure_updating
 |     self._get_next_update_delay(),
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 230, in _get_next_update_delay
 |     minimum_interval = self.get_minimum_update_interval()
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 79, in get_minimum_update_interval
 |     return self._authinfo.computer.get_minimum_job_poll_interval()
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/site-packages/aiida/orm/authinfos.py", line 87, in computer
 |     return entities.from_backend_entity(computers.Computer, self._backend_entity.computer)
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/site-packages/aiida/storage/psql_dos/orm/authinfos.py", line 74, in computer
 |     return self.backend.computers.ENTITY_CLASS.from_dbmodel(self.model.dbcomputer, self.backend)
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/site-packages/aiida/storage/psql_dos/orm/utils.py", line 84, in __getattr__
 |     if self.is_saved() and self._is_mutable_model_field(item) and not self._in_transaction():
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/site-packages/aiida/storage/psql_dos/orm/utils.py", line 110, in is_saved
 |     return self._model.id is not None
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/site-packages/sqlalchemy/orm/attributes.py", line 482, in __get__
 |     return self.impl.get(state, dict_)
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/site-packages/sqlalchemy/orm/attributes.py", line 942, in get
 |     value = self._fire_loader_callables(state, key, passive)
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/site-packages/sqlalchemy/orm/attributes.py", line 973, in _fire_loader_callables
 |     return state._load_expired(state, passive)
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/site-packages/sqlalchemy/orm/state.py", line 712, in _load_expired
 |     self.manager.expired_attribute_loader(self, toload, passive)
 |   File "/home/jyu/micromamba/envs/aiida-sssp-unstable/lib/python3.9/site-packages/sqlalchemy/orm/loading.py", line 1369, in load_scalar_attributes
 |     raise orm_exc.DetachedInstanceError(
 | sqlalchemy.orm.exc.DetachedInstanceError: Instance <DbAuthInfo at 0x7f3982390640> is not bound to a Session; attribute refresh operation cannot proceed (Background on this error at: https://sqlalche.me/e/14/bhk3)
+-> WARNING at 2023-05-17 00:08:14.491510+02:00
 | maximum attempts 5 of calling do_update, exceeded

Steps to reproduce

Steps to reproduce the behavior:

Only happened when I submit 40 of my pseudopotential workchains, each one will spawn 100 small pw.x calculation. Therefore not easy to reproduce from scratch, but interestingly since in the process list I have many processes is the pausing state after 5 maximum attempts, I can reproduce with and submit 10 of my workchains.

Expected behavior

Your environment

Other relevant software versions, e.g. Postres & RabbitMQ

Additional context

unkcpz commented 1 year ago

Just find I encounter this before https://github.com/aiidateam/aiida-core/issues/4596 and also reported by @sphuber https://github.com/aiidateam/aiida-core/issues/1292

EDIT: According to what I reported in https://github.com/aiidateam/aiida-core/issues/4596, I need to restart not only the daemon but also restart DB services. Anyway it is very annoying issue prevent me from running "real" high-throughputs calculation, I have to using submission control script to make sure no more than 10 workchain run at the same time.

sphuber commented 1 year ago

I am pretty sure you only need to reset the daemon, not the DB service. But I agree, this needs to be fixed. Let's continue discussion in the other issue

unkcpz commented 1 year ago

The problem is I do verdi process play -a (after restart daemon and DB service) and all paused processes restarted but throw the same error after a run a while.

sphuber commented 1 year ago

You do verdi daemon restart --reset? Also, can you make sure that you don't have any "rogue" daemon processes running. Stop the daemon and then run ps aux | grep verdi and make sure there are no daemon workers running. because if so, they might still be picking up the jobs and if they have the inconsistent session, they will produce the same error again.

unkcpz commented 1 year ago

@sphuber, I encounter it again and restart the daemon clearly, all the processes are back and working fine. Thanks! I guess maybe you are correct I didn't assure the daemon is fully restarted.