FirebirdSQL / firebird

Firebird server, client and tools
https://firebirdsql.org
1.26k stars 217 forks source link

Сrash when deleting a statement from the cache #8245

Open dmitry-starodubov opened 2 months ago

dmitry-starodubov commented 2 months ago

We have several crashes of 5.0 with similar symptoms (and stacks). The stacks of the two threads are attached. In the first one, DsqlStatementCache::shrink() is called during statement release. It removes the statement from the inactiveStatementList. In the second thread transaction is committed (from the same attachment). This calls DsqlStatementCache::purgeAllAttachments(), which via AST calls DsqlStatementCache::purge(), which clears the inactiveStatementList too. As a result, both threads have DsqlStatementCache::StatementEntry::~StatementEntry() in their stacks at the same time. There is a call of MET_index_pagespace (part of the TableSpaces feature), which is not present in Firebird. I suspect it works here as a catalyst, because it increases the execution time of DsqlStatementCache::purge and increases the chances of a race in the list.

dyemanov commented 1 month ago

@asfernandes , any chance to look at this issue sooner rather than later?

asfernandes commented 1 month ago

A test case would make easy to identify the problem and would be fundamental to validate a fix.

asfernandes commented 1 month ago

How can both threads of the same attachment be active in the engine at the same time? Shouldn't one be blocked in the attachment lock of EngineContextHolder?

hvlad commented 1 month ago

The 2nd thread handle AST and uses other attachment's context, then it calls MET_index_pagespace() that call JRD_reschedule() that checks out from engine, releasing attachment's lock and allowing 1st thread to access attachment which internal structure (inactiveStatementList) is in "interesting", not consistent state.

The most suspicious as for me is the call of MET_index_pagespace() (that run some query) from Statement::release().

asfernandes commented 1 month ago

The most suspicious as for me is the call of MET_index_pagespace() (that run some query) from Statement::release().

And AFAIK, it's not valid to do it from AST handler.

dyemanov commented 1 month ago

Even without query execution, AST may call LCK_release/LCK_convert that also checks out from the engine and thus may potentially cause races. We simply cannot guarantee that AST cannot interfere with the main attachment, it's just about higher or lower probability. If the current code is fragile about that, I'd say this deserves fixing anyway.

hvlad commented 1 month ago

Lock manager checks out from the engine in wait_for_request, blocking_action and shutdownOwner. The only concern here is wait_for_request - as blocking_action is called by wait_for_request and shutdownOwner is called when attachment is released.

LCK_release doesn't call wait_for_request thus its never checks out from the engine.

LCK_convert usually should not wait when converts to the lower lock level, also it not wait when called with LCK_NO_WAIT. It is easy to control, IMO. Am I wrong ?

And I still sure that AST routine should avoid calling expensive routines, such as query execution. Didn't check if Firebird already have such code, though ;)

dyemanov commented 1 month ago

Seems you're right, I was more pessimistic about the LM calls than necessary ;-)

And I still sure that AST routine should avoid calling expensive routines, such as query execution.

I'm not arguing with that, just thought it was not the only possible reason for a checkout.