Open tillprochaska opened 2 months ago
SQL query hint:
select group_id, member_id, count(*) from role_membership group by group_id, member_id having count(*) > 1
It looks like a UNIQUE constraint on (group_id, member_id)
would avoid these situations (but I'm not sure of more underlying issues).
For future reference: While looking a little closer at logs when testing a fix for this, I noticed that in one case, there were two overlapping OAuth requests happening at the same time, processed by different pods. Both requests had a separate auth code and state, i.e. both requests were part of separate OAuth flows.
In order to sync groups, Aleph first clears all current group memberships, then inserts new group memberships based on the data in the OAuth token retrieved from the ID provider. However, Aleph does this in a single transaction, so I wouldn’t expect this to be the issue.
For future reference: While looking a little closer at logs when testing a fix for this, I noticed that in one case, there were two overlapping OAuth requests happening at the same time, processed by different pods. Both requests had a separate auth code and state, i.e. both requests were part of separate OAuth flows.
Some additional context:
I think the issue is that while deleting and inserting the role memberships is part of a single transaction, determining which role memberships to delete is not. When clearing role memberships for a given user, SQLAlchemy doesn’t construct a query like this:
DELETE FROM role_membership WHERE member_id = 123
Instead, it deletes specific rows based on data read before:
DELETE FROM role_membership WHERE member_id = 123 AND group_id = 456
DELETE FROM role_membership WHERE member_id = 123 AND group_id = 789
If a user is added to a new group, the following race condition can occur when two overlapping OAuth requests are handled:
t | Request 1 | Request 2 |
---|---|---|
1 | Begin transaction | |
2 | Fetch role memberships for user (Group A, Group B) | |
3 | End transaction | |
4 | Begin transaction | |
5 | Fetch role memberships for user (Group A, Group B) | |
6 | End transaction | |
7 | Begin transaction | |
8 | Delete role memberships for user (Group A, Group B) | |
9 | Insert new role memberships (Group A, Group B, Group C) | |
10 | End transaction | |
11 | Begin transaction | |
12 | Delete role memberships for user (Group A, Group B) | |
13 | Insert new role memberships (Group A, Group B, Group C) | |
14 | End transaction |
As a result, the role membership for Group C is stored twice.
Describe the bug Some users experience a 500 error when signing in using OAuth.
The
role_membership
table stores which groups a user is part of. It has two columns,group_id
androle_id
(migration). However, it doesn’t define any constraints (composite primary key or a unique constraint, so it is possible to add duplicate rows to the table.Duplicate rows in this table can cause errors during the OAuth callback. After authenticating a user, Aleph "syncs" the groups the user is part of. It does this by first deleting all rows in the
role_membership
for the user, then adding new rows for all of the groups the users is currently part of. Groups are encoded in the OAuth token returned by the identity provider, e.g. Keycloak.In case of duplicate rows, there’s a mismatch between the number of rows SQLAlchemy expects to delete and the number of rows that are actually deleted. This results in the following error:
To Reproduce Steps to reproduce the behavior:
role_membership
table, you should see one row representing the group membership set up in step 2.Aleph version 3.17.0 and 4.0.0-rc33
Additional context