alephdata / aleph

Search and browse documents and data; find the people and companies you look for.
http://docs.aleph.occrp.org
MIT License
2k stars 267 forks source link

BUG: Error during OAuth callback / missing constraints for `role_membership` table #3811

Open tillprochaska opened 2 months ago

tillprochaska commented 2 months ago

Describe the bug Some users experience a 500 error when signing in using OAuth.

The role_membership table stores which groups a user is part of. It has two columns, group_id and role_id (migration). However, it doesn’t define any constraints (composite primary key or a unique constraint, so it is possible to add duplicate rows to the table.

Duplicate rows in this table can cause errors during the OAuth callback. After authenticating a user, Aleph "syncs" the groups the user is part of. It does this by first deleting all rows in the role_membership for the user, then adding new rows for all of the groups the users is currently part of. Groups are encoded in the OAuth token returned by the identity provider, e.g. Keycloak.

In case of duplicate rows, there’s a mismatch between the number of rows SQLAlchemy expects to delete and the number of rows that are actually deleted. This results in the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask_cors/extension.py", line 176, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/aleph/aleph/views/sessions_api.py", line 123, in oauth_callback
    role = handle_oauth(oauth.provider, oauth_token)
  File "/aleph/aleph/oauth.py", line 100, in handle_oauth
    role.clear_roles()
  File "/aleph/aleph/model/role.py", line 109, in clear_roles
    db.session.flush()
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/scoping.py", line 897, in flush
    return self._proxied.flush(objects=objects)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/session.py", line 4179, in flush
    self._flush(objects)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/session.py", line 4315, in _flush
    transaction.rollback(_capture_exception=True)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/session.py", line 4275, in _flush
    flush_context.execute()
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/unitofwork.py", line 466, in execute
    rec.execute(self)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/unitofwork.py", line 591, in execute
    self.dependency_processor.process_saves(uow, states)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/dependency.py", line 1195, in process_saves
    self._run_crud(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/dependency.py", line 1220, in _run_crud
    raise exc.StaleDataError(
sqlalchemy.orm.exc.StaleDataError: DELETE statement on table 'role_membership' expected to delete 1 row(s); Only 2 were matched.

To Reproduce Steps to reproduce the behavior:

  1. Set up a local Keycloak instance for development as explained in this guide.
  2. Create a user and add the user to a user group as explained in the second section of the group.
  3. Open Aleph and sign in using OAuth.
  4. You should be able to sign in successfully.
  5. When inspecting the contents of the role_membership table, you should see one row representing the group membership set up in step 2.
  6. Manually add another duplicate row to the table.
  7. Sign out of Aleph and sign in again using OAuth.
  8. You should see an internal server error. The API logs should contain the error from above.

Aleph version 3.17.0 and 4.0.0-rc33

Additional context

stchris commented 2 months ago

SQL query hint:

select group_id, member_id, count(*) from role_membership group by group_id, member_id having count(*) > 1

It looks like a UNIQUE constraint on (group_id, member_id) would avoid these situations (but I'm not sure of more underlying issues).

tillprochaska commented 1 month ago

For future reference: While looking a little closer at logs when testing a fix for this, I noticed that in one case, there were two overlapping OAuth requests happening at the same time, processed by different pods. Both requests had a separate auth code and state, i.e. both requests were part of separate OAuth flows.

In order to sync groups, Aleph first clears all current group memberships, then inserts new group memberships based on the data in the OAuth token retrieved from the ID provider. However, Aleph does this in a single transaction, so I wouldn’t expect this to be the issue.

tillprochaska commented 1 month ago

For future reference: While looking a little closer at logs when testing a fix for this, I noticed that in one case, there were two overlapping OAuth requests happening at the same time, processed by different pods. Both requests had a separate auth code and state, i.e. both requests were part of separate OAuth flows.

Some additional context:

I think the issue is that while deleting and inserting the role memberships is part of a single transaction, determining which role memberships to delete is not. When clearing role memberships for a given user, SQLAlchemy doesn’t construct a query like this:

DELETE FROM role_membership WHERE member_id = 123

Instead, it deletes specific rows based on data read before:

DELETE FROM role_membership WHERE member_id = 123 AND group_id = 456
DELETE FROM role_membership WHERE member_id = 123 AND group_id = 789

If a user is added to a new group, the following race condition can occur when two overlapping OAuth requests are handled:

t Request 1 Request 2
1 Begin transaction
2 Fetch role memberships for user (Group A, Group B)
3 End transaction
4 Begin transaction
5 Fetch role memberships for user (Group A, Group B)
6 End transaction
7 Begin transaction
8 Delete role memberships for user (Group A, Group B)
9 Insert new role memberships (Group A, Group B, Group C)
10 End transaction
11 Begin transaction
12 Delete role memberships for user (Group A, Group B)
13 Insert new role memberships (Group A, Group B, Group C)
14 End transaction

As a result, the role membership for Group C is stored twice.