depot / kysely-planetscale

A Kysely dialect for PlanetScale Serverless
https://depot.dev/blog/kysely-dialect-planetscale
MIT License
351 stars 15 forks source link

Throwing an error during a transaction doesn't allow the lock to roll back #20

Closed OultimoCoder closed 1 year ago

OultimoCoder commented 1 year ago

I am using the latest version.

Here is my code:

export const createOauthUser = async (
  providerUser: OauthUser,
  databaseConfig: Config['database']
) => {
  const db = getDBClient(databaseConfig)
  try {
    await db.transaction().execute(async (trx) => {
      const userId = await trx
        .insertInto('user')
        .values({
          name: providerUser.name,
          email: providerUser.email,
          is_email_verified: true,
          password: null,
          role: 'user'
        })
        .executeTakeFirstOrThrow()
      console.log(userId)
      await trx
        .insertInto('authorisations')
        .values({
          user_id: Number(userId.insertId),
          provider_type: providerUser.providerType,
          provider_user_id: providerUser.id.toString()
        })
        .executeTakeFirstOrThrow()
      console.log('ended')
      return userId
    })
  } catch (error) {
    try {
      await db.deleteFrom('user').where('user.email', '=', providerUser.email).execute()
    } catch (err) {
      console.log(err)
    }
    console.log('here')
    throw new ApiError(
      httpStatus.FORBIDDEN,
      `Cannot signup with ${providerUser.providerType}, user already exists with that email`
    )
  }
  const user = await getUserByProviderIdType(
    providerUser.id.toString(), providerUser.providerType, databaseConfig
  )
  return User.convert(user)
}

I added the delete statement in the try catch purely to handle an error I am having. If I throw an error during my transaction it should rollback the transaction and release the locks but I instead get this error intermittently (the other times it completes in under 20 seconds):

DatabaseError: target: dictionary-api.-.primary: vttablet: rpc error: code = DeadlineExceeded desc = Lock wait timeout exceeded; try restarting transaction (errno 1205) (sqlstate HY000) (CallerID: ihkeoohzfw2oodprmc73): Sql: "delete from `user` where `user`.email = :user_email", BindVars: {REDACTED}
        at Connection.execute (/backend/node_modules/@planetscale/database/dist/index.js:78:19)
        at processTicksAndRejections (node:internal/process/task_queues:95:5)
        at _PlanetScaleConnection.executeQuery (/backend/node_modules/kysely-planetscale/dist/index.js:92:21)
        at /backend/node_modules/kysely/dist/esm/query-executor/query-executor-base.js:35:28
        at DefaultConnectionProvider.provideConnection (/backend/node_modules/kysely/dist/esm/driver/default-connection-provider.js:10:20)
        at DefaultQueryExecutor.executeQuery (/backend/node_modules/kysely/dist/esm/query-executor/query-executor-base.js:34:16)
        at DeleteQueryBuilder.execute (/backend/node_modules/kysely/dist/esm/query-builder/delete-query-builder.js:392:24)
        at Module.createOauthUser (/backend/src/services/user.service.ts:65:7)
        at Module.loginOrCreateUserWithOauth (/backend/src/services/auth.service.ts:93:19)
        at oauthCallback (/backend/src/controllers/auth/oauth/oauth.controller.ts:27:16) {
      status: 400,
      body: {
        message: 'target: dictionary-api.-.primary: vttablet: rpc error: code = DeadlineExceeded desc = Lock wait timeout exceeded; try restarting transaction (errno 1205) (sqlstate HY000) (CallerID: ihkeoohzfw2oodprmc73): Sql: "delete from `user` where `user`.email = :user_email", BindVars: {REDACTED}',
        code: 'UNKNOWN'
      }
    }

This is the sole test I am running and the sole code executing.

OultimoCoder commented 1 year ago

I found this upgrading from 1.1 to 1.3

jacobwgillespie commented 1 year ago

@OultimoCoder this used to work for you with version 1.1? I'm not seeing anything in the diff that would directly change the behavior of transactions: https://github.com/depot/kysely-planetscale/compare/v1.1.0...v1.3.0.

Did you upgrade anything else at the same time (kysely or @planetscale/database)? It still could be something here, but if not, the other thing to check would be if another request was holding a lock that conflicted with the DELETE, I'm not super familiar with what Vitess locks on delete.

OultimoCoder commented 1 year ago

Yeah 1.2 works fine but going to 1.3 breaks it, nothing else changes. I tested just now only incrementing this package version until it broke.

jacobwgillespie commented 1 year ago

Interesting - are you setting the useSharedConnection config option? That's the thing that 1.3 added, but it's supposed to be disabled by default.

OultimoCoder commented 1 year ago

I haven't made any changes from 1.1, just upgraded the package to 1.3.

OultimoCoder commented 1 year ago

Could it be related to the changes made that fixed this other issue I reported with transactions in 1.3 #17

jacobwgillespie commented 1 year ago

Well v1.2.0 should have that fix included, so if you're not able to reproduce on 1.2, then it seems unrelated to that.

OultimoCoder commented 1 year ago
import { Kysely } from 'kysely'
import { PlanetScaleDialect } from 'kysely-planetscale'
import { ApiKeyTable } from '../models/apiKey.model'
import { AuthProviderTable } from '../models/authProvider.model'
import { EntriesTable } from '../models/entry.model'
import { FieldSourcesTable } from '../models/fieldSource.model'
import { SentencesTable } from '../models/sentence.model'
import { SourcesTable } from '../models/source.model'
import { UserTable } from '../models/user.model'
import { Config } from './config'

let dbClient: Kysely<Database>

export interface Database {
  user: UserTable,
  authorisations: AuthProviderTable
  api_keys: ApiKeyTable,
  sentences: SentencesTable,
  entries: EntriesTable,
  sources: SourcesTable,
  field_sources: FieldSourcesTable
}

export const getDBClient = (databaseConfig: Config['database']) => {
  dbClient =
    dbClient ||
    new Kysely<Database>({
      dialect: new PlanetScaleDialect({
        username: databaseConfig.username,
        password: databaseConfig.password,
        host: databaseConfig.host
      })
    })
  return dbClient
}

This is how my client is instantiated, I have done 0 code changes when testing upgrades.

OultimoCoder commented 1 year ago

And I did a repeat test just to check, 100% something introduced between 1.2 and 1.3 which causes this error.

jacobwgillespie commented 1 year ago

Oh actually 1.2.1 included the transaction fixes from #17, 1.2.0 did not. Did you try it with 1.2.1, or just 1.2.0?

OultimoCoder commented 1 year ago

Fails with 1.2.1!

OultimoCoder commented 1 year ago

I actually had another project that was using this package and I created a support ticket on planetscale as two queries at the same time were exceeding the transaction deadline of 20 seconds and they were doing like 1 read and 2 updates in the transaction and averaged a second total and I was very confused.

I would bet its the exact same issue as I was running version 1.3.

jacobwgillespie commented 1 year ago

Yeah this is interesting, it's almost as if the rollback is not called? 1.2.1 updated the logic to await the rollback (and everything else), but I'm not sure how that would cause the issue you're seeing.

@koskimas do you see anything I'm doing incorrectly here? v1.2.0...v1.2.1#diff-a2a171449d. I'm not super familiar with what Kysely internals do when an error is thrown inside a transaction.

sromic commented 1 year ago

@OultimoCoder @jacobwgillespie I was able to replicate this issue but only when I put delete statement into catch block within the transaction scope, but not like it's in the original code.

Now haven't find the root cause for this 20 sec query execution time so far, but there is something what could help @OultimoCoder to mitigate this problem.

Basically, I found that v1.2.0 doesn't perform rollbacks in case of an error within the transaction - I guess that's why in the given code there is a delete statement in case of the error. Starting with v1.2.1 this seems to be fixed, and rollbacks are performed - inserts aren't committed.

So with rollbacks truly supported now, @OultimoCoder you should be able to get rid of delete statement in catch block and have rollback do it for you.

OultimoCoder commented 1 year ago

I originally found this when running tests using jest and a delete test would run after a test that intentionally errored in the catch block.

The delete code in the catch block is just an easy way of replicating it. I only put it there when I want to replicate the issue.

My issue is the 20 second timeout with the locks not being released after an error in the catch block.

OultimoCoder commented 1 year ago

Any update on this? :(

jacobwgillespie commented 1 year ago

I haven't had a chance to investigate further yet - the thing that was fixed in v1.2.1 (#10) was that transaction commands (BEGIN, COMMIT, ROLLBACK) weren't actually being awaited, so they could fail silently.

What I don't know here is if there's an issue inside kysely-planetscale or @planetscale/database that's preventing the normal transaction semantics from working, or if there's a legitimate lock contention in PlanetScale / Vitess that is only now showing up for you because previously it would have silently failed.

OultimoCoder commented 1 year ago

Does the demo code not prove there is no legitimate lock contention?

A failed transaction followed by a delete should always work, no?

I might have an investigate into this too then see if I can find out anymore.

jacobwgillespie commented 1 year ago

A failed transaction followed by a delete should always work, no?

Possibly - how much traffic is this app getting? If you're getting this with a single request, then I'd expect it to work, but if there's concurrent requests / queries then it gets more complicated.

OultimoCoder commented 1 year ago

Tested locally. Repeatable and reproducible with a single request or integration test. 0 concurrency. About 75% of the time it times out, other 25% it completes in about 18-19 seconds.

In a separate app, I had the same issue in prod if there were 2 requests at the same time and one transaction failed.

The demo code was just added (I never normally delete in the try catch) to make it reproducible with one request.

OultimoCoder commented 1 year ago

Ill write a reproducible script for this.

OultimoCoder commented 1 year ago

@jacobwgillespie hey!

I have a repo here which reproduces the issue pretty consistently. Very minimal code with a quick readme on how to run it (you just need a fresh planetscale db/branch).

https://github.com/OultimoCoder/kysely-planetscale-issue-20

jacobwgillespie commented 1 year ago

@OultimoCoder thank you for that reproduction, that's super helpful - I ran it with quite a bit of logging inside the kysely-planetscale code, and it appeared to be issuing the expected queries, it correctly created a separate connection for the transaction, etc.

I adapted it to use @planetscale/database directly, with raw SQL queries, and it's still timing out due to a lock. You can see the updated code here: https://github.com/jacobwgillespie/kysely-planetscale-issue-20.

So, I think this is either an issue inside @planetscale/database, or some misunderstanding of how locks are working with PlanetScale. I would assume the former, since it feels like a single transaction, with ROLLBACK shouldn't affect anything like you said.

OultimoCoder commented 1 year ago

Thanks so much for looking into this! Should I raise this issue on the @planetscale/database repo and point to both of our codebases?

jacobwgillespie commented 1 year ago

I think that makes sense, yeah, they're pretty responsive over there 👍

jacobwgillespie commented 1 year ago

This was fixed in @planetscale/database v1.10.0 🎉

https://github.com/planetscale/database-js/releases/tag/v1.10.0