Open simerplaha opened 2 years ago
The good old deadlock issue. From the error, I'm assuming that it occurred because sbt
runs each test class concurrently so we got witness this rare event. The error message mentions two processes 1425
and 1426
so I'm assuming two tests run by sbt
in two different threads.
Looking at the queries executed by TokenSupplyServiceSpec
to figure out where this could happen -BlockQueries.updateMainChainStatusAction
seems to be the most likely one to cause this.
To confirm we can replicate similar error message with the following debug test
The future returned an exception of type: org.postgresql.util.PSQLException, with message: ERROR: deadlock detected Detail: Process 60779 waits for ShareLock on transaction 61990937; blocked by process 60770. Process 60770 waits for ShareLock on transaction 61990946; blocked by process 60779.
"deadlock debug" in {
implicit val groupSettings: GroupSetting = GroupSetting(1)
val genBlock = //generate test data
for {
chainFrom <- groupIndexGen
chainTo <- groupIndexGen
entity <- blockEntityGen(chainFrom, chainTo, None)
} yield entity
val block = genBlock.sample.get //generate single block
BlockDao.insertAll(Seq(block)).futureValue //insert the block
val futures: Seq[Future[Unit]] = //concurrently invoke updateMainChainStatusAction
Array.fill(100)(run(BlockQueries.updateMainChainStatusAction(block.hash, true)))
Future.sequence(futures).futureValue //fails here with message: ERROR: deadlock detected
}
The query BlockQueries.updateMainChainStatusAction
is being called multiple times, recursively within these functions - BlockDao and SanityCheck.
These functions are concurrent, recursive and non-transactional. I'm not fully sure if this deadlock could occur in both these function in production. They are probably safe? Would be a @tdroxler question.
But SanityChecker might be problematic because variables running
, i
and totalNbOfBlocks
are not thread-safe. Two threads executing SanityChecker.check()
might lead to this deadlock. This can be looked into within #242?
Following the error message hint
. If we could get access to those server logs we'd know which query for sure. Is this possible @killerwhile?
Hint: See server log for query details.
These functions in BlockDao and SanityCheck need to be safe and non-concurrent. Transactional if needed?
BlockQueries.updateMainChainStatusAction
might need some sort of safety around it so it does not get executed concurrently nor should it block on threads.
These variables running
, i
and totalNbOfBlocks
in SanityChecker
need to be thread-safe.
Thoughts?
Amazing analysis as always.
We could indeed tackle #242, we don't run SanityChecker
that often but that would be a shame that it crashes the app ^^
For the BlockDao
function, we could tackle #155 at the same time, but not sure it will bring more safety. As you mention I think in production it's safe, but still can we do something to avoid a flaky test again? maybe just trying not do what I did there and use it concurrently?
we don't run
SanityChecker
that often but that would be a shame that it crashes the app ^^
Oh I doubt it would crash the app. We would just see that deadlock which would timeout and Postgres would terminate the transactions.
Worry would be, data getting partially persisted. Because some calls are not transactional. But like you said, it's safe, so this might not happen either. But we should probably aim for strong guarantees via types.
Since Scheduler does not reschedule on failure. Whatever background service running this query would stop.
maybe just trying not do what I did there and use it concurrently?
Like change it to be concurrent?
Like change it to be concurrent?
Or anything else so we don't have flaky test :)
I guess we can't make it concurrent because that's causing the problem.
If I'm not wrong about sbt parallel execution causing this, then temporarily, we can turn that off, but our builds are already very slow
parallelExecution in Test := false
Not sure what else 🤔 ... We could wrap that query updateMainChainStatusAction
around synchronised
block, but thats just criminal 💀. I'm not sure this would work anyway because, they query is returning a DBIOAction
I will start on #242 as a subtask of this task, should figure something out as I work on it.
FYI I got the issue in this run .
No need to tackle this now, it's still very rare
Deadlock occurred again.
Got it again, it's always on the TokenSupplyServiceSpec
I'll continue the debugging on this as it's always painful to get a failed build due to flaky test.
When moving from H2
to PSQL
for our tests, I knew it could be an issue as every test were using the same db.
I'm trying in this branch to use one database per test, I still need to tweak a bit, but it look promising.
Got curious. Wanted to see if there is an easy fix for this.
To narrow the problematic tables I used the following SQL query to replace this typed query. Deadlock occurs even if queries are not transactional. Turns out the problem is not specific to a table, it's just how Postgres behaves for all tables. Concurrent table mutation will always result in a deadlock.
Reading up, folks say there is no Postgres level configuration that can fix this. Application level code should manage the order of query execution to avoid these deadlocks.
def updateMainChainQuery2(mainChain: Boolean, blocks: ArraySeq[BlockHash]) = {
def whereClause(columnName: String): String =
Array.fill(blocks.size)(s"$columnName = ?").mkString(" OR ")
val query =
s"""
|BEGIN;
|UPDATE transactions SET main_chain = ? WHERE ${whereClause("block_hash")};
|UPDATE outputs SET main_chain = ? WHERE ${whereClause("block_hash")};
|UPDATE inputs SET main_chain = ? WHERE ${whereClause("block_hash")};
|UPDATE block_headers SET main_chain = ? WHERE ${whereClause("hash")};
|UPDATE transaction_per_addresses SET main_chain = ? WHERE ${whereClause("block_hash")};
|UPDATE transaction_per_token SET main_chain = ? WHERE ${whereClause("block_hash")};
|UPDATE token_tx_per_addresses SET main_chain = ? WHERE ${whereClause("block_hash")};
|UPDATE token_outputs SET main_chain = ? WHERE ${whereClause("block_hash")};
|COMMIT;
|
|""".stripMargin
val parameters: SetParameter[Unit] =
(_: Unit, params: PositionedParameters) =>
(1 to 8) foreach { _ =>
params >> mainChain
blocks foreach (params >> _)
}
SQLActionBuilder(
queryParts = query,
unitPConv = parameters
).asUpdate
}
A not so good solution would be to simply retry the query. Which people seem to suggest.
We can hook this untested function to a specific query or in just Runner
which would retry a query if a deadlock occurs for n
number of times with delays (or no delays, just retry if a deadlock occurs). Yea I know this sounds bad :(
Quoting this I understand one transaction will still be allowed to go through so in theory retrying would work and all deadlocked queries would eventually succeed on retries.
if transaction 1 acquires an exclusive lock on table A and then tries to acquire an exclusive lock on table B, while transaction 2 has already exclusive-locked table B and now wants an exclusive lock on table A, then neither one can proceed. PostgreSQL automatically detects deadlock situations and resolves them by aborting one of the transactions involved, allowing the other(s) to complete.
def retryOnDeadlock[T](retryAttempts: Int = 10,
retryDelay: FiniteDuration = 2.seconds,
previousAttemptError: Option[PSQLException] = None)(f: => Future[T])(implicit ec: ExecutionContext,
scheduler: Scheduler): Future[T] =
if (retryAttempts <= 0) //ran out of attempts. Exit with previous error
previousAttemptError match {
case Some(failure) => Future.failed(failure)
case None => Future.failed(new Exception(s"Invalid retryAttempts input $retryAttempts"))
}
else
f recoverWith { //rerun/retry query and recover if there is a deadlock
case error: PSQLException if error.getMessage.contains("deadlock") =>
scheduler.scheduleOnce(s"Deadlock Retry. Attempt $retryAttempts", retryDelay) {
retryOnDeadlock(
retryAttempts = retryAttempts - 1,
previousAttemptError = Some(error)
)(f)
}
}
It would be neat if sync was just performing reads initially, builds a state in-memory and persists this new state with no concurrency to a single table to move sync forward. But this a big change. So probably not doable right now.
In LSM based storage engines these deadlocks don't occur because of different architecture and I don't have any experience playing with Postgres's transaction isolation levels
But using SERIALIZABLE
isolation level didn't work.
ERROR: could not serialize access due to concurrent update.
I will see if other isolation levels can help at all.
Still looking for a simpler solution.
Thx for the nice investigation.
I also thought of the retry solution, but as you said it's not a very nice solution, could be a temporary solution if we see it happening a lot in prod. Last time it happen it was because there were multiple ReadWrite
running, otherwise I don't think we saw it happen no? only in test.
You are right. I've also only seen this when running tests.
But I wouldn't be surprised if it happens in production in the future. Some new code concurrently executing a function that mutates a table is all it needs to deadlock.
Temporarily un-assigning myself as I'm not currently working on this.
But if we can, possibly in the future, guarantee that our code would never lead to deadlock under any situation that would be very nice.
For now, I can confirm: I've only seen it occur during tests.
ERROR: deadlock detected
in build -TokenSupplyServiceSpec