Open dmitry-starodubov opened 2 months ago
@asfernandes , any chance to look at this issue sooner rather than later?
A test case would make easy to identify the problem and would be fundamental to validate a fix.
How can both threads of the same attachment be active in the engine at the same time?
Shouldn't one be blocked in the attachment lock of EngineContextHolder
?
The 2nd thread handle AST and uses other attachment's context, then it calls MET_index_pagespace() that call JRD_reschedule() that checks out from engine, releasing attachment's lock and allowing 1st thread to access attachment which internal structure (inactiveStatementList) is in "interesting", not consistent state.
The most suspicious as for me is the call of MET_index_pagespace() (that run some query) from Statement::release().
The most suspicious as for me is the call of MET_index_pagespace() (that run some query) from Statement::release().
And AFAIK, it's not valid to do it from AST handler.
Even without query execution, AST may call LCK_release/LCK_convert
that also checks out from the engine and thus may potentially cause races. We simply cannot guarantee that AST cannot interfere with the main attachment, it's just about higher or lower probability. If the current code is fragile about that, I'd say this deserves fixing anyway.
Lock manager checks out from the engine in wait_for_request
, blocking_action
and shutdownOwner
.
The only concern here is wait_for_request
- as blocking_action
is called by wait_for_request
and shutdownOwner
is called when attachment is released.
LCK_release
doesn't call wait_for_request
thus its never checks out from the engine.
LCK_convert
usually should not wait when converts to the lower lock level, also it not wait when called with LCK_NO_WAIT.
It is easy to control, IMO.
Am I wrong ?
And I still sure that AST routine should avoid calling expensive routines, such as query execution. Didn't check if Firebird already have such code, though ;)
Seems you're right, I was more pessimistic about the LM calls than necessary ;-)
And I still sure that AST routine should avoid calling expensive routines, such as query execution.
I'm not arguing with that, just thought it was not the only possible reason for a checkout.
We have several crashes of 5.0 with similar symptoms (and stacks). The stacks of the two threads are attached. In the first one, DsqlStatementCache::shrink() is called during statement release. It removes the statement from the inactiveStatementList. In the second thread transaction is committed (from the same attachment). This calls DsqlStatementCache::purgeAllAttachments(), which via AST calls DsqlStatementCache::purge(), which clears the inactiveStatementList too. As a result, both threads have DsqlStatementCache::StatementEntry::~StatementEntry() in their stacks at the same time. There is a call of MET_index_pagespace (part of the TableSpaces feature), which is not present in Firebird. I suspect it works here as a catalyst, because it increases the execution time of DsqlStatementCache::purge and increases the chances of a race in the list.