fief-dev / fief

Users and authentication management SaaS
https://www.fief.dev
Other
488 stars 42 forks source link

500 Internal Server Error while using Admin REST API #181

Open Kh-Oleg opened 1 year ago

Kh-Oleg commented 1 year ago

Describe the bug

Fief responds with "500 Internal Server Error" while using different Admin REST API from a script.

To Reproduce

Steps to reproduce the behavior:

  1. Start Fief instance with one default tenant, one default client, one Admin API Key
  2. Execute a script doing the following:
    • GET /tenants/ # Get default tenant id
    • POST /permissions/ # Create permission1
    • POST /permissions/ # Create permission2
    • POST /users/ # Create user1
    • POST /users/ # Create user2
    • POST /users/{user_id1}/permissions # Assign permission1 to user1
    • POST /users/{user_id2}/permissions # Assign permission2 to user2 Assuming that requests have all needed and correct POST payloads, headers and Authorization Bearer.

Expected behavior

No "500 Internal server error" responses from Fief.

Configuration

self-hosted, Fief version: 0.24.2

Additional context

The error is non-deterministic, it's not related to any particular request, any request can return 500. When I added 1 second delay between requests, everything went OK. You can see the successful log below.

xxx@xxx:/opt/src# fief run-server --port 8001 2>&1
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
Main Fief workspace already exists
Main Fief user already exists
2023-05-02 15:08:24.725 | INFO     | fief.lifespan:lifespan:31 - Fief Server started - {"version": "0.24.2"}
INFO:     127.0.0.1:50620 - "GET /tenants/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:50630 - "POST /permissions/ HTTP/1.1" 201 Created
2023-05-02 15:08:31.188 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "trigger_webhooks"}
2023-05-02 15:08:31.274 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "trigger_webhooks"}
INFO:     127.0.0.1:50632 - "POST /permissions/ HTTP/1.1" 201 Created
2023-05-02 15:08:31.636 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "trigger_webhooks"}
2023-05-02 15:08:31.647 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "trigger_webhooks"}
INFO:     127.0.0.1:50644 - "POST /users/ HTTP/1.1" 201 Created
2023-05-02 15:08:32.796 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "on_after_register"}
2023-05-02 15:08:32.871 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "on_after_register"}
2023-05-02 15:08:32.877 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "trigger_webhooks"}
2023-05-02 15:08:32.889 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "trigger_webhooks"}
INFO:     127.0.0.1:50650 - "POST /users/ HTTP/1.1" 201 Created
2023-05-02 15:08:33.847 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "on_after_register"}
2023-05-02 15:08:33.900 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "on_after_register"}
2023-05-02 15:08:33.921 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "trigger_webhooks"}
2023-05-02 15:08:33.936 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "trigger_webhooks"}
INFO:     127.0.0.1:50662 - "POST /users/40284c05-9b18-431d-a2ef-b9f1707d76d3/permissions HTTP/1.1" 201 Created
2023-05-02 15:08:34.871 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "trigger_webhooks"}
2023-05-02 15:08:34.886 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "trigger_webhooks"}
INFO:     127.0.0.1:50668 - "POST /users/90e4cecd-7898-43a3-84d6-feacdeebaab6/permissions HTTP/1.1" 201 Created
2023-05-02 15:08:35.884 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "trigger_webhooks"}
2023-05-02 15:08:35.912 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "trigger_webhooks"}
^C

Without delays, however, I could not execute my script fully, not getting 500 at some of the requests. The log in this case looks like this:

xxx@xxx:/opt/src# fief run-server --port 8001 2>&1
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
Main Fief workspace already exists
Main Fief user already exists
2023-05-02 15:12:02.562 | INFO     | fief.lifespan:lifespan:31 - Fief Server started - {"version": "0.24.2"}
INFO:     127.0.0.1:60742 - "GET /tenants/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:52424 - "POST /permissions/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:52434 - "POST /permissions/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:52450 - "POST /users/ HTTP/1.1" 201 Created
2023-05-02 15:12:08.245 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "trigger_webhooks"}
INFO:     127.0.0.1:52456 - "POST /users/ HTTP/1.1" 500 Internal Server Error
2023-05-02 15:12:08.344 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "trigger_webhooks"}
2023-05-02 15:12:08.351 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "trigger_webhooks"}
2023-05-02 15:12:08.381 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "trigger_webhooks"}
2023-05-02 15:12:08.402 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "trigger_webhooks"}
2023-05-02 15:12:08.437 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "trigger_webhooks"}
2023-05-02 15:12:08.443 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "on_after_register"}
2023-05-02 15:12:08.519 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "on_after_register"}
^C

xxx@xxx:/opt/src# fief run-server --port 8001 2>&1
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
Main Fief workspace already exists
Main Fief user already exists
2023-05-02 15:36:08.331 | INFO     | fief.lifespan:lifespan:31 - Fief Server started - {"version": "0.24.2"}
INFO:     127.0.0.1:44862 - "GET /tenants/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:44870 - "POST /permissions/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:44874 - "POST /permissions/ HTTP/1.1" 500 Internal Server Error
2023-05-02 15:36:12.437 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "trigger_webhooks"}
2023-05-02 15:36:12.511 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "trigger_webhooks"}
^C

It looks like, there is a race related to webhooks somewhere.

frankie567 commented 1 year ago

That's strange indeed.

Would it be possible for you to share the full script?

Meanwhile, a few things to try, would be glad to see the effect on your issue:

  1. Set the environment variable WEB_CONCURRENCY to 1
  2. Set the environment variable TELEMETRY_ENABLED to False
Kh-Oleg commented 1 year ago

TELEMETRY_ENABLED was set to 0 before. I tried also with WEB_CONCURRENCY=1 - nothing has changed. My real script does more, but I could reproduce the bug with a smaller one:

import requests
import json
import time

fief_url = 'http://localhost:8001/admin/api'

# Assuming Admin API Key exists in Fief's database
API_KEY='*************************************'

def post(route, data):
    # time.sleep(1) # Required, since Fief 0.24.2 has a race condition somewhere
    headers = {"Content-Type": "application/json", 
               "accept": "application/json",
               "Authorization": f"Bearer {API_KEY}"}

    response = requests.post(f"{fief_url}/{route}", headers=headers, data=json.dumps(data), timeout=0.5)
    if response.status_code >= 400:
        print(f"Error! data: {json.dumps(data)}")
        raise Exception(response.content)
    return response.json()

def create_permission(data):
    response = post('permissions/', data)
    return response['id']

# Create permissions
for i in range(1, 10):
    permission_id = create_permission({'name': f"Create Castle{i}", 'codename': f"createCastle{i}"})
    print(f"permission created, id: {permission_id}")

Fief log output:

INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
Main Fief workspace already exists
Main Fief user already exists
2023-05-02 19:46:15.955 | INFO     | fief.lifespan:lifespan:31 - Fief Server started - {"version": "0.24.2"}
xxx@xxx:/opt/src# INFO:     127.0.0.1:40426 - "POST /permissions/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:40436 - "POST /permissions/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:40450 - "POST /permissions/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:40462 - "POST /permissions/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:40476 - "POST /permissions/ HTTP/1.1" 500 Internal Server Error
2023-05-02 19:46:43.756 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "trigger_webhooks"}
2023-05-02 19:46:43.825 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "trigger_webhooks"}
2023-05-02 19:46:43.831 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "trigger_webhooks"}
2023-05-02 19:46:43.844 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "trigger_webhooks"}
2023-05-02 19:46:43.878 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "trigger_webhooks"}
2023-05-02 19:46:43.903 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "trigger_webhooks"}
2023-05-02 19:46:43.909 | INFO     | fief.tasks.base:__call__:117 - Start task - {"task": "trigger_webhooks"}
2023-05-02 19:46:43.937 | INFO     | fief.tasks.base:__call__:119 - Done task - {"task": "trigger_webhooks"}

I see, that each time, when the problem doesn't occur, both 'Start task' and 'Done task' are executed before the next request arrives. The opposite is also true: when the 500 occurs, web hooks appear in the log after one or more following requests. This makes me thinking that there is, probably, some kind of deferred web hook trigger, which doesn't handle properly the case, when many requests arrive one after another.

500 most likely means some exception inside Fief, but log output doesn't show it. I would capture the Exception stack trace as well, if you'll tell me how to enable it.

frankie567 commented 1 year ago

I was able to reproduce it (especially when creating users, not when creating permissions). Here is the error I got:

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked

It seems there is a lock on the SQLite database, probably because several processes try to access it concurrently. I'll try to see what can be done about it.

frankie567 commented 1 year ago

I was able to track down the problem: it comes from the AuditLogger sink, which makes requests to the DB in a background thread.

My intuition is that, sometimes, when a subsequent request comes to create a user/permission, the AuditLog from the previous request is currently being written to the DB, which raises the database lock error.

I'll see how to mitigate this. Maybe by deferring the AuditLogger completely in the background worker.

Kh-Oleg commented 1 year ago

The bug still exists, but is only less frequent now.

I'd suggest reopening the bug and investigating it further, GitHub doesn't allow me to reopen it.

matteius commented 7 months ago

I am getting a similar error I think just trying to login to admin now -- I was recently able to login though, still trying to debug it at the moment.