Closed giovannipizzi closed 10 months ago
this is caused due to the default user variable set on the UserCollection
which, if not reset when the session/storage is closed, will now point to a detached SqlaUser
model.
This is fixed by: https://github.com/aiidateam/aiida-core/blob/b57e1b66a8eb91b75cd0c2a8416e56874e28f814/aiida/storage/psql_dos/backend.py#L112-L117
(Note, you should never directly close the SQLA session of a PsqlDosBackend
instance, always go through PsqlDosBackend.close
)
Thanks @chrisjsewell ! Just to make sure I understood - this is already fixed now in develop, or not yet? Also thanks for the comment on not closing manually - but I think me "as a user" I was not doing it (I think), it was probably some part of AiiDA doing it?
this is already fixed now in develop
It should be yes; there is now nowhere that directly closes the sqlalchemy session, except for the actual PsqlDosBackend
instance (when it is closed)
(well apart from in the REST API, but that's another matter)
Pre v2, it was interfacing with the session all over the place (see the diagrams in #5330), so I don't know exactly where it would have been expunged (which is what happens when it is closed)
But yeh obviously we can re-test with aiida v2, to check for sure that this is no longer occurring
Closing this for now. Feel free to reopen if you encounter it with v2.0
Unfortunately, I encounter this again when I launch > 400 quick calcjobs locally. Here is the traceback. Let me know how can I future debug it.
+-> ERROR at 2022-05-11 11:45:25.792422+02:00
| Traceback (most recent call last):
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
| result = await coro()
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 189, in do_update
| with job_manager.request_job_info_update(authinfo, job_id) as update_request:
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/contextlib.py", line 119, in __enter__
| return next(self.gen)
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 286, in request_job_info_update
| with self.get_jobs_list(authinfo).request_job_info_update(job_id) as request:
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/contextlib.py", line 119, in __enter__
| return next(self.gen)
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 167, in request_job_info_update
| self._ensure_updating()
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 195, in _ensure_updating
| self._get_next_update_delay(),
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 230, in _get_next_update_delay
| minimum_interval = self.get_minimum_update_interval()
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 79, in get_minimum_update_interval
| return self._authinfo.computer.get_minimum_job_poll_interval() | File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/orm/authinfos.py", line 87, in computer
| return computers.Computer.from_backend_entity(self._backend_entity.computer)
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/storage/psql_dos/orm/authinfos.py", line 74, in computer
| return self.backend.computers.ENTITY_CLASS.from_dbmodel(self.model.dbcomputer, self.backend)
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/storage/psql_dos/orm/utils.py", line 84, in __getattr__
| if self.is_saved() and self._is_mutable_model_field(item) and not self._in_transaction():
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/storage/psql_dos/orm/utils.py", line 110, in is_saved
| return self._model.id is not None
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/sqlalchemy/orm/attributes.py", line 481, in __get__
| return self.impl.get(state, dict_)
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/sqlalchemy/orm/attributes.py", line 941, in get
| value = self._fire_loader_callables(state, key, passive)
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/sqlalchemy/orm/attributes.py", line 972, in _fire_loader_callables
| return state._load_expired(state, passive)
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/sqlalchemy/orm/state.py", line 710, in _load_expired
| self.manager.expired_attribute_loader(self, toload, passive)
| File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/sqlalchemy/orm/loading.py", line 1369, in load_scalar_attributes
| raise orm_exc.DetachedInstanceError(
| sqlalchemy.orm.exc.DetachedInstanceError: Instance <DbAuthInfo at 0x7f75926d1a00> is not bound to a Session; attribute refresh operation cannot proceed (Background on this error at: https://sqlalche.me/e/14/bhk3)
I restart the daemon and restart Postgres backend server, and seems the issue does not show up.
I am not sure this is due to the User
model being detached, because it is the DbAuthInfo
that pops up in the error message. When this happens, whatever the reason, it is because the Sqlalchemy session is in an inconsistent state. The Python interpreter is still holding on to ORM instances that reference a database model that is no longer in the session.
The only thing that should help to remedy is, is to reset the daemon, as restarting the daemon workers will recreate the session and it should be in a consistent state again.
Now, as for why the session gets to this state, I am not sure. There used to be a similar bug related to the User
model, as mentioned before in this thread. There, the User
instance of the default user would be set on the Collection
in memory. This wasn't cleared properly when the session got closed, and so when reopened, the same old instance would be used, but its database model was no longer attached to the new session. This problem has been solved, by explicitly unsetting this default user in memory when the storage was closed.
Here it seems to be about the AuthInfo
though. Both in this report and in the duplicate #6024 the exception comes from the JobManager.request_job_info_update
method that is called by the CalcJob
in any of its tasks, for example, task_update_job
. This manager keeps a mapping of JobsList
instances for each AuthInfo
it manages. The error comes when the JobsList
calls get_minimum_update_interval
at which point it access the computer
attribute of the AuthInfo
which causes the exception, since it needs to access the database at that point.
The behavior could be explained if the storage was closed and reopened during the life of a daemon worker. Because if the same Runner
instance is kept, it still holds a reference to JobManager
, which still has the old _job_lists
mapping, where each JobsList
still holds the original AuthInfo
instance. The only problem with this theory, is that the daemon worker should never close the storage during its lifetime. So I don't see how this could happen.
Still, I think it probably has something to do with the JobManager
keeping this mapping of JobsList
that each holds on to an AuthInfo
reference.
@unkcpz could you please try this branch on your environment: https://github.com/sphuber/aiida-core/tree/fix/4596/db-authinfo-detached
It is not relying on the AuthInfo
used when the JobsList
gets constructed at startup of the daemon worker, but rather use the AuthInfo
that is actually used by the CalcJob
when it calls request_job_info_update
. Hopefully that instance is still attached to the session and so it should circumvent the old one by overwriting it.
@sphuber Sorry for the late reply, I didn't notice your message. I can not reproduce the issues, but once it appears again, I'll check out to your branch and try it. I am using this plugin this and next week, so there is a chance I may encounter the issue again.
Thanks, I will rebase the branch so it is up to date with main
.
Describe the bug
Roughly half of my calculations are failing on a SQLA backend:
I'm quite sure this was a production AiiDA 1.4.2 environment where I hadn't done anything and things were working fine until a few weeks ago. Running yesterday, maybe calculations were pased with 5 consecutive errors and the error above. I decided to stop the deamon, reinstall AiiDA 1.5.0 and replay them, but they fail again with the same error.
Any idea of what could be causing this?
@sphuber @CasperWA @chrisjsewell