aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.65k stars 3.91k forks source link

aws-redshift: Granting privileges to multiple users on a single table can cause concurrency errors #22852

Open ricky-escobar opened 1 year ago

ricky-escobar commented 1 year ago

Describe the bug

With Redshift, concurrency errors are possible when running multiple GRANT statements for different users on the same table at the same time; for example, GRANT SELECT ON Table TO User1 and GRANT SELECT ON Table TO User2. However, the Redshift Table/User CDK constructs do not account for this. During CloudFormation deployment, multiple TablePrivileges resources can create/update at the same time, which can cause concurrency errors and stack update failures.

Expected Behavior

When creating/updating multiple users with privileges for common tables, the CloudFormation stack update always succeeds with no errors.

Current Behavior

When creating/updating multiple users with privileges for common tables, the CloudFormation stack update sometimes fails, with the following error for a TablePrivileges resource:

Received response status [FAILED] from custom resource. Message returned: Statement status was FAILED: ERROR: could not complete because of conflict with concurrent transaction 
Logs: /aws/lambda/MyStack-QueryRedshiftDatabase3de-0JlACCPDaXyQ at waitForStatementComplete (/var/task/util.js:34:15) 
at processTicksAndRejections (internal/process/task_queues.js:95:5) 
at async Object.executeStatement (/var/task/util.js:18:5) 
at async Promise.all (index 5) at async grantPrivileges (/var/task/privileges.js:34:5) 
at async updatePrivileges (/var/task/privileges.js:52:9) 
at async handler (/var/task/privileges.js:18:29) (RequestId: XXX)

This is sensitive to the exact times that the concurrent custom resource Lambdas execute the statements. So the failure does not always. The failure is more likely to happen the more users and tables are involved. Reproducing this was very inconsistent on our end. We got 10 failures in a row when doing deploying to our production account, but struggled to reproduce it in our test accounts. We have 6 tables and 2 users.

Reproduction Steps

const users = Array.from(Array(10).keys()).map(n => new User(this, `User${n}`, {
  adminUser: clusterAdminUserSecret,
  cluster: redshiftCluster,
  databaseName: databaseName,
  encryptionKey: kmsKey,
  username: `User${n}`,
}));

const tables = Array.from(Array(10).keys()).map(n => new Table(this, `Table${n}`, {
  cluster: redshiftCluster,
  adminUser: clusterAdminUserSecret,
  databaseName: databaseName,
  tableName: `Table${n}`,
  tableColumns: [{name: 'field', dataType: 'int'}],
}));

for (const user of users) {
  for (const table of tables) {
    table.grant(user, TableAction.SELECT, TableAction.INSERT);
  }
}

Possible Solution

We worked around this on our end by introducing an arbitrary dependency between the Users, forcing them to be updated sequentially rather than concurrently. Something similar could potentially be done internally. In case this is difficult to fix, the documentation can be updated to clearly call out this known issue and suggest this workaround to the user.

user2.node.addDependency(user1);

Additional Information/Context

No response

CDK CLI Version

1.125.0 (build 67b4921)

Framework Version

No response

Node.js Version

12.x

OS

Amazon Linux 2

Language

Typescript

Language Version

No response

Other information

No response

manmartgarc commented 7 months ago

I tried the work around and got cyclic dependency errors.

Same here

manmartgarc commented 6 months ago

Is this because each user creates its own Lambda that runs concurrently as opposed to all of the users being creates/assigned to a same lambda that calls a BatchExecuteStatement call to the Redshift Data API?

For example, here is where the table.grant method is defined. It calls the user.addTablePrivileges method, which in turn calls the UserTablePrivileges custom construct.

It looks like every call to table.grant will actually create a separate construct and execute the GRANT statement concurrently, which is what's causing the error. Ideally all of the changes associated with a particular construct (a table) should be part of the same transaction.

manmartgarc commented 6 months ago

Can we control the concurrency limit of the lambda function itself? What if we set it up to one concurrent execution maximum?

This is probably not the ideal solution, but I think it's way easier than figuring out a way to accumulate all the statements and then execute them all in a single call to the the API

manmartgarc commented 2 months ago

I am not sure if that would help. I think the simple solution is to run all these statements in a single transaction block; that way we avoid concurrent attempts to get a hold of the mutex.

But I am not sure if this is trivial. For example, if through CDK we generate 100 statements (100 different users getting read access to the same table) then this can be put into the same transaction. However, if there are 100 statements, where 50 give 50 users read access to table A and the other 50 statements give the same users read access to table B we would have to collect the statements per resource, i.e. table. So that in the latter case two transactions are sent to the Redshift Data API: 1 for table A and one for table B.

manmartgarc commented 2 months ago

Sorry but I am not sure how the lazy eval would help here? It seems the main issue with the current approach is due to concurrently attempting to change the same Redshift resource (my guess is it's the table) in different transactions. I think that this will keep happening as long as we have concurrent executions of the same Lambda trying to change the same object.