alephium / explorer-backend

The explorer backend for Alephium protocol
https://explorer.alephium.org
GNU Lesser General Public License v3.0
7 stars 13 forks source link

`org.postgresql.util.PSQLException`: ERROR: deadlock detected #323

Open simerplaha opened 2 years ago

simerplaha commented 2 years ago

ERROR: deadlock detected in build - TokenSupplyServiceSpec

[info] TokenSupplyServiceSpec: [info] org.alephium.explorer.service.TokenSupplyServiceSpec ABORTED [info] org.postgresql.util.PSQLException: ERROR: deadlock detected [info] Detail: Process 1425 waits for AccessExclusiveLock on relation 16750 of database 13757; blocked by process 1426. [info] Process 1426 waits for RowExclusiveLock on relation 16728 of database 13757; blocked by process 1425. [info] Hint: See server log for query details. [info] at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2533) [info] at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2268) [info] at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:313) [info] at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:448) [info] at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:369) [info] at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:159) [info] at org.postgresql.jdbc.PgPreparedStatement.execute(PgPreparedStatement.java:148) [info] at slick.jdbc.StatementInvoker.results(StatementInvoker.scala:39) [info] at slick.jdbc.StatementInvoker.iteratorTo(StatementInvoker.scala:22) [info] at slick.jdbc.Invoker.first(Invoker.scala:30) [info] ...

simerplaha commented 2 years ago

Possible cause

The good old deadlock issue. From the error, I'm assuming that it occurred because sbt runs each test class concurrently so we got witness this rare event. The error message mentions two processes 1425 and 1426 so I'm assuming two tests run by sbt in two different threads.

Replicating issue locally

Looking at the queries executed by TokenSupplyServiceSpec to figure out where this could happen -BlockQueries.updateMainChainStatusAction seems to be the most likely one to cause this.

To confirm we can replicate similar error message with the following debug test

The future returned an exception of type: org.postgresql.util.PSQLException, with message: ERROR: deadlock detected Detail: Process 60779 waits for ShareLock on transaction 61990937; blocked by process 60770. Process 60770 waits for ShareLock on transaction 61990946; blocked by process 60779.

"deadlock debug" in {
  implicit val groupSettings: GroupSetting = GroupSetting(1)

  val genBlock = //generate test data
    for {
      chainFrom <- groupIndexGen
      chainTo   <- groupIndexGen
      entity    <- blockEntityGen(chainFrom, chainTo, None)
    } yield entity

  val block = genBlock.sample.get //generate single block
  BlockDao.insertAll(Seq(block)).futureValue //insert the block

  val futures: Seq[Future[Unit]] = //concurrently invoke updateMainChainStatusAction
    Array.fill(100)(run(BlockQueries.updateMainChainStatusAction(block.hash, true)))

  Future.sequence(futures).futureValue //fails here with message: ERROR: deadlock detected
}

Is this a production problem?

The query BlockQueries.updateMainChainStatusAction is being called multiple times, recursively within these functions - BlockDao and SanityCheck.

These functions are concurrent, recursive and non-transactional. I'm not fully sure if this deadlock could occur in both these function in production. They are probably safe? Would be a @tdroxler question.

But SanityChecker might be problematic because variables running, i and totalNbOfBlocks are not thread-safe. Two threads executing SanityChecker.check() might lead to this deadlock. This can be looked into within #242?

Finding the actual query

Following the error message hint. If we could get access to those server logs we'd know which query for sure. Is this possible @killerwhile?

Hint: See server log for query details.

Possible solution

These functions in BlockDao and SanityCheck need to be safe and non-concurrent. Transactional if needed?

BlockQueries.updateMainChainStatusAction might need some sort of safety around it so it does not get executed concurrently nor should it block on threads.

These variables running, i and totalNbOfBlocks in SanityChecker need to be thread-safe.

Thoughts?

tdroxler commented 2 years ago

Amazing analysis as always.

We could indeed tackle #242, we don't run SanityChecker that often but that would be a shame that it crashes the app ^^

For the BlockDao function, we could tackle #155 at the same time, but not sure it will bring more safety. As you mention I think in production it's safe, but still can we do something to avoid a flaky test again? maybe just trying not do what I did there and use it concurrently?

simerplaha commented 2 years ago

we don't run SanityChecker that often but that would be a shame that it crashes the app ^^

Oh I doubt it would crash the app. We would just see that deadlock which would timeout and Postgres would terminate the transactions.

Worry would be, data getting partially persisted. Because some calls are not transactional. But like you said, it's safe, so this might not happen either. But we should probably aim for strong guarantees via types.

Since Scheduler does not reschedule on failure. Whatever background service running this query would stop.

maybe just trying not do what I did there and use it concurrently?

Like change it to be concurrent?

tdroxler commented 2 years ago

Like change it to be concurrent?

Or anything else so we don't have flaky test :)

simerplaha commented 2 years ago

I guess we can't make it concurrent because that's causing the problem.

If I'm not wrong about sbt parallel execution causing this, then temporarily, we can turn that off, but our builds are already very slow

parallelExecution in Test := false

Not sure what else 🤔 ... We could wrap that query updateMainChainStatusAction around synchronised block, but thats just criminal 💀. I'm not sure this would work anyway because, they query is returning a DBIOAction

I will start on #242 as a subtask of this task, should figure something out as I work on it.

tdroxler commented 2 years ago

FYI I got the issue in this run .

No need to tackle this now, it's still very rare

simerplaha commented 1 year ago

Deadlock occurred again.

tdroxler commented 1 year ago

Got it again, it's always on the TokenSupplyServiceSpec I'll continue the debugging on this as it's always painful to get a failed build due to flaky test.

tdroxler commented 1 year ago

When moving from H2 to PSQL for our tests, I knew it could be an issue as every test were using the same db. I'm trying in this branch to use one database per test, I still need to tweak a bit, but it look promising.

simerplaha commented 1 year ago

Documenting further investigation on this

Got curious. Wanted to see if there is an easy fix for this.

To narrow the problematic tables I used the following SQL query to replace this typed query. Deadlock occurs even if queries are not transactional. Turns out the problem is not specific to a table, it's just how Postgres behaves for all tables. Concurrent table mutation will always result in a deadlock.

Reading up, folks say there is no Postgres level configuration that can fix this. Application level code should manage the order of query execution to avoid these deadlocks.

def updateMainChainQuery2(mainChain: Boolean, blocks: ArraySeq[BlockHash]) = {
  def whereClause(columnName: String): String =
    Array.fill(blocks.size)(s"$columnName = ?").mkString(" OR ")

  val query =
    s"""
       |BEGIN;
       |UPDATE transactions              SET main_chain = ? WHERE ${whereClause("block_hash")};
       |UPDATE outputs                   SET main_chain = ? WHERE ${whereClause("block_hash")};
       |UPDATE inputs                    SET main_chain = ? WHERE ${whereClause("block_hash")};
       |UPDATE block_headers             SET main_chain = ? WHERE ${whereClause("hash")};
       |UPDATE transaction_per_addresses SET main_chain = ? WHERE ${whereClause("block_hash")};
       |UPDATE transaction_per_token     SET main_chain = ? WHERE ${whereClause("block_hash")};
       |UPDATE token_tx_per_addresses    SET main_chain = ? WHERE ${whereClause("block_hash")};
       |UPDATE token_outputs             SET main_chain = ? WHERE ${whereClause("block_hash")};
       |COMMIT;
       |
       |""".stripMargin

  val parameters: SetParameter[Unit] =
    (_: Unit, params: PositionedParameters) =>
      (1 to 8) foreach { _ =>
        params >> mainChain
        blocks foreach (params >> _)
      }

  SQLActionBuilder(
    queryParts = query,
    unitPConv = parameters
  ).asUpdate
}

Not so good solution

A not so good solution would be to simply retry the query. Which people seem to suggest.

We can hook this untested function to a specific query or in just Runner which would retry a query if a deadlock occurs for n number of times with delays (or no delays, just retry if a deadlock occurs). Yea I know this sounds bad :(

Quoting this I understand one transaction will still be allowed to go through so in theory retrying would work and all deadlocked queries would eventually succeed on retries.

if transaction 1 acquires an exclusive lock on table A and then tries to acquire an exclusive lock on table B, while transaction 2 has already exclusive-locked table B and now wants an exclusive lock on table A, then neither one can proceed. PostgreSQL automatically detects deadlock situations and resolves them by aborting one of the transactions involved, allowing the other(s) to complete.

def retryOnDeadlock[T](retryAttempts: Int = 10,
                       retryDelay: FiniteDuration = 2.seconds,
                       previousAttemptError: Option[PSQLException] = None)(f: => Future[T])(implicit ec: ExecutionContext,
                                                                                            scheduler: Scheduler): Future[T] =
  if (retryAttempts <= 0) //ran out of attempts. Exit with previous error
    previousAttemptError match {
      case Some(failure) => Future.failed(failure)
      case None          => Future.failed(new Exception(s"Invalid retryAttempts input $retryAttempts"))
    }
  else
    f recoverWith { //rerun/retry query and recover if there is a deadlock
      case error: PSQLException if error.getMessage.contains("deadlock") =>
        scheduler.scheduleOnce(s"Deadlock Retry. Attempt $retryAttempts", retryDelay) {
          retryOnDeadlock(
            retryAttempts = retryAttempts - 1,
            previousAttemptError = Some(error)
          )(f)
        }
    }

A neat solution

It would be neat if sync was just performing reads initially, builds a state in-memory and persists this new state with no concurrency to a single table to move sync forward. But this a big change. So probably not doable right now.

Changing transaction isolation level

In LSM based storage engines these deadlocks don't occur because of different architecture and I don't have any experience playing with Postgres's transaction isolation levels

But using SERIALIZABLE isolation level didn't work.

ERROR: could not serialize access due to concurrent update.

I will see if other isolation levels can help at all.

Still looking for a simpler solution.

tdroxler commented 1 year ago

Thx for the nice investigation.

I also thought of the retry solution, but as you said it's not a very nice solution, could be a temporary solution if we see it happening a lot in prod. Last time it happen it was because there were multiple ReadWrite running, otherwise I don't think we saw it happen no? only in test.

simerplaha commented 1 year ago

You are right. I've also only seen this when running tests.

But I wouldn't be surprised if it happens in production in the future. Some new code concurrently executing a function that mutates a table is all it needs to deadlock.

Temporarily un-assigning myself as I'm not currently working on this.

But if we can, possibly in the future, guarantee that our code would never lead to deadlock under any situation that would be very nice.

For now, I can confirm: I've only seen it occur during tests.