cookiecutter / cookiecutter-django

Cookiecutter Django is a framework for jumpstarting production-ready Django projects quickly.
https://cookiecutter-django.readthedocs.io
BSD 3-Clause "New" or "Revised" License
11.96k stars 2.87k forks source link

PostgreSQL - connection failed: FATAL: remaining connection slots are reserved for non-replication superuser connections #5071

Closed agusmakmun closed 2 months ago

agusmakmun commented 3 months ago

What happened?

I'm having this postgresql error when doing upgrade the django cookie cutter for my project.

connection failed: FATAL: remaining connection slots are reserved for non-replication superuser connections

I'm using many celery tasks, websocket connections, and deploy using the k8s in my project.

The similar issues are mentioned in Stack Overflow:

One of the thread mentions this package, but not sure if it's working or not: https://github.com/altairbow/django-db-connection-pool

Details

(env-my-labs) ➜  env-my-labs cookiecutter https://github.com/cookiecutter/cookiecutter-django
  [1/27] project_name (My Awesome Project): My Project Labs
  [2/27] project_slug (practical_devsecops_cloud_labs):
  [3/27] description (Behold My Awesome Project!): My Project Labs
  [4/27] author_name (Daniel Roy Greenfeld): My Project Developer
  [5/27] domain_name (example.com): my-domain.com
  [6/27] email (practical-devsecops-developer@my-domain.com): info@my-domain.com
  [7/27] version (0.1.0): 3.7.0
  [8/27] Select open_source_license
    1 - MIT
    2 - BSD
    3 - GPLv3
    4 - Apache Software License 2.0
    5 - Not open source
    Choose from [1/2/3/4/5] (1): 5
  [9/27] Select username_type
    1 - username
    2 - email
    Choose from [1/2] (1): 1
  [10/27] timezone (UTC):
  [11/27] windows (n): n
  [12/27] Select editor
    1 - None
    2 - PyCharm
    3 - VS Code
    Choose from [1/2/3] (1): 3
  [13/27] use_docker (n): y
  [14/27] Select postgresql_version
    1 - 16
    2 - 15
    3 - 14
    4 - 13
    5 - 12
    Choose from [1/2/3/4/5] (1): 5
  [15/27] Select cloud_provider
    1 - AWS
    2 - GCP
    3 - Azure
    4 - None
    Choose from [1/2/3/4] (1): 1
  [16/27] Select mail_service
    1 - Mailgun
    2 - Amazon SES
    3 - Mailjet
    4 - Mandrill
    5 - Postmark
    6 - Sendgrid
    7 - SendinBlue
    8 - SparkPost
    9 - Other SMTP
    Choose from [1/2/3/4/5/6/7/8/9] (1): 5
  [17/27] use_async (n): y
  [18/27] use_drf (n): y
  [19/27] Select frontend_pipeline
    1 - None
    2 - Django Compressor
    3 - Gulp
    4 - Webpack
    Choose from [1/2/3/4] (1): 1
  [20/27] use_celery (n): y
  [21/27] use_mailpit (n): y
  [22/27] use_sentry (n): y
  [23/27] use_whitenoise (n): n
  [24/27] use_heroku (n): n
  [25/27] Select ci_tool
    1 - None
    2 - Travis
    3 - Gitlab
    4 - Github
    5 - Drone
    Choose from [1/2/3/4/5] (1): 3
  [26/27] keep_local_envs_in_vcs (y): y
  [27/27] debug (n): n
 [SUCCESS]: Project initialized, keep up the good work!
foarsitter commented 3 months ago

This is a postgresql configuration issue and not related to this project.

agusmakmun commented 3 months ago

But the problem is, this case is not happening in Django 3.2, and only appear after migrated to Django 4. Any different between the versions? I check at the CONN_MAX_AGE using the same 60s value.

foarsitter commented 3 months ago

If that is the problem, why don't you report it as such?

agusmakmun commented 3 months ago

I'm using the django channels for websocket connection, and upgrading to these pacakges.

django==4.2.11  # pyup: < 5.0  # https://www.djangoproject.com/

# Django Async
channels==4.1.0  # https://github.com/django/channels/
channels-redis==4.2.0  # https://github.com/django/channels_redis/

# asgi_redis (async to sync)
asgiref==3.8.1  # https://github.com/django/asgiref/

For max_connections is 100 as default from Postgres 12.

Did you use async before Django 4?

I'm only using async in the websocket things, and celery tasks to cleanup the sandbox machines.

foarsitter commented 3 months ago

Increasing max_connections to 400 would be a simple solution

agusmakmun commented 3 months ago
  1. nope, I'm using a single application but setup using k8s.

  2. 12 workers as I manually config in compose/production/django/start

    exec /usr/local/bin/gunicorn config.asgi --bind 0.0.0.0:5000 --chdir=/app -k uvicorn.workers.UvicornWorker --workers=${UVICORN_TOTAL_OF_WORKERS:=12}
  3. I'm not sure of it, but we're using single pod of celeryworker.

    kubectl --kubeconfig=.vscode/development/kubeconfig get pods
    NAME                           READY   STATUS    RESTARTS   AGE
    celerybeat-7dccfd8965-mbjt9    2/2     Running   0          4h31m
    celeryworker-fc7d59c8c-9bgcv   2/2     Running   0          4h31m
    django-5d54d9ddbf-dmpk6        2/2     Running   0          4h31m
    flower-78c4b8f74f-8h8hx        2/2     Running   0          4h31m
    next-849ff6c88b-95tgn          1/1     Running   0          4h31m
    websocket-768b6cd964-gj9j4     2/2     Running   0          4h31m
  4. Yup, we're using uvicorn workers as in above.

agusmakmun commented 3 months ago

Let say if many users are using the websocket at the same time. is increasing the max_connections to 400 will solve the problem?

foarsitter commented 3 months ago

How many workers does celery use? 12?

Do the math: you are using two django pods, resulting in at least 24 connections. I assume the same for celery, makes at least 48 in total. You are running beat twice, that will result in double scheduling. Why are you running flower twice? Do next & websockets read the database?

Howere, my guess would be that the catch is in worker-connections I suppose. Each worker has many http connections and each http connection has his own database connection. So with your 24 workers and the default worker-connections of 1000 you need 24000 database connections at max for django alone. Here is some more info: https://stackoverflow.com/questions/63471960/gunicorn-uvicorn-worker-py-how-to-honor-limit-concurrency-setting

agusmakmun commented 3 months ago

How many workers does celery use? 12?

How to check the how many celery workers?

This is what I get so far;

kubectl --kubeconfig=kubeconfig exec --stdin --tty celeryworker-6dc48c545d-p879n -- /bin/bash

Defaulted container "celeryworker" out of: celeryworker, vault-agent, vault-agent-init (init)
django@celeryworker-6dc48c545d-p879n:/app$ source /entrypoint
PostgreSQL is available
django@celeryworker-6dc48c545d-p879n:/app$
django@celeryworker-6dc48c545d-p879n:/app$
django@celeryworker-6dc48c545d-p879n:/app$ celery -A config.celery_app status
->  celery@celeryworker-6dc48c545d-p879n: OK

1 node online.

and for more:

django@celeryworker-6dc48c545d-p879n:/app$ celery -A config.celery_app inspect stats
->  celery@celeryworker-6dc48c545d-p879n: OK
    {
        "broker": {
            "alternates": [],
            "connect_timeout": 4,
            "failover_strategy": "round-robin",
            "heartbeat": 120.0,
            "hostname": "redis-master.redis.svc.cluster.local.",
            "insist": false,
            "login_method": null,
            "port": 6379,
            "ssl": false,
            "transport": "redis",
            "transport_options": {},
            "uri_prefix": null,
            "userid": null,
            "virtual_host": "0"
        },
        "clock": "107892",
        "pid": 8,
        "pool": {
            "implementation": "celery.concurrency.prefork:TaskPool",
            "max-concurrency": 8,
            "max-tasks-per-child": "N/A",
            "processes": [
                16,
                17,
                18,
                19,
                20,
                21,
                22,
                23
            ],
            "put-guarded-by-semaphore": false,
            "timeouts": [
                300,
                600
            ],
            "writes": {
                "all": "0.67, 0.17, 0.17",
                "avg": "0.33",
                "inqueues": {
                    "active": 0,
                    "total": 8
                },
                "raw": "4, 1, 1",
                "strategy": "fair",
                "total": 6
            }
        },
        "prefetch_count": 32,
        "rusage": {
            "idrss": 0,
            "inblock": 40,
            "isrss": 0,
            "ixrss": 0,
            "majflt": 3,
            "maxrss": 166536,
            "minflt": 187490,
            "msgrcv": 0,
            "msgsnd": 0,
            "nivcsw": 833,
            "nsignals": 0,
            "nswap": 0,
            "nvcsw": 1597,
            "oublock": 56,
            "stime": 1.8547509999999998,
            "utime": 10.366434
        },
        "total": {
            "my_project.base.tasks.send_announcement_broadcast": 1,
            "my_project.webssh.tasks.cleanup_expired_durations": 1,
            "my_project.webssh.tasks.cleanup_machines": 3,
            "my_project.webssh.tasks.sync_cloud_drive_account": 1
        },
        "uptime": 533
    }

1 node online.
foarsitter commented 3 months ago

"max-concurrency": 8

agusmakmun commented 3 months ago

anything that I can do apart from increasing the max_connections?

foarsitter commented 3 months ago

You have at least these 4 options as far I can tell:

agusmakmun commented 3 months ago

Thank you so much @foarsitter, we will try your suggestions :+1:

foarsitter commented 3 months ago

If you resolve your issue by implementing a custom UvicornWorker we would like to receive a PR :)

agusmakmun commented 3 months ago

Sure, will happy to do that. For quick solution we will try to increase the postgres max_connections to 1000 first.

agusmakmun commented 3 months ago

Finally my issue was solved. This happens because of using both shared_task() and @celery_app.task(), and it leads duplicate database connection. So, after rollback all to use @celery_app.task() only, it's resolve the problem.

Note: I changed it to @shared_task() because trying to follow the users/tasks.py for non-executable tasks for cronjob.

https://github.com/cookiecutter/cookiecutter-django/blob/beba4c177d6c0c40e51ec71fa4447966c10a4297/%7B%7Bcookiecutter.project_slug%7D%7D/%7B%7Bcookiecutter.project_slug%7D%7D/users/tasks.py#L1-L9

Before

  1. myapp/tasks.py
    
    from celery import shared_task
    from config import celery_app

@shared_task() def task_1(): ...

@celery_app.task() def task_2_that_called_for_cronjob(): ...


### After

```python
from config import celery_app

@celery_app.task()
def task_1():
   ...

@celery_app.task()
def task_2_that_called_for_cronjob():
   ...
agusmakmun commented 3 months ago

The issue still appears again.

foarsitter commented 3 months ago

shared_task is just another way of registering tasks for situations when there is no app available yet. So that can't be the solution.

agusmakmun commented 3 months ago

I found some related issue are because of impact of upgrade Django 4 + using sync_to_async, as I'm right now using it to handle the async process for the websocket.

  1. https://stackoverflow.com/questions/75848263/too-many-db-connections-django-4-x-asgi
  2. https://code.djangoproject.com/ticket/33497
  3. https://code.djangoproject.com/ticket/32889
  4. https://github.com/django/django/commit/36fa071d6ebd18a61c4d7f1b5c9d17106134bd44

This may be due to Django 4.0 having per-request contexts for the thread sensitivity of sync_to_async() — See #32889. If so, that's kind-of a good thing, in that too many open resources is what you'd expect in async code, and up to now, we've not been hitting that, as we've essentially been running serially. - https://code.djangoproject.com/ticket/33497#comment:4

foarsitter commented 3 months ago

So you put CONN_MAX_AGE to 0 and solved the issue?

agusmakmun commented 3 months ago

Nope, I had changed it to 0. And the error still occurs.

agusmakmun commented 3 months ago

The issue finally resolved after implementing the django-db-connection-pool + set the CONN_MAX_AGE=0

  1. settings.py
DATABASES = {"default": env.db("DATABASE_URL")}
DATABASES["default"]["ATOMIC_REQUESTS"] = True
# https://github.com/altairbow/django-db-connection-pool?tab=readme-ov-file#postgresql
DATABASES["default"]["ENGINE"] = "dj_db_conn_pool.backends.postgresql"
DATABASES["default"]["POOL_OPTIONS"] = {
    "POOL_SIZE": 10,
    "MAX_OVERFLOW": 10,
    "RECYCLE": 24 * 60 * 60,
}

# Set to 0 to disable long(er) living connections
# https://docs.djangoproject.com/en/dev/ref/settings/#std-setting-CONN_MAX_AGE
# https://github.com/cookiecutter/cookiecutter-django/issues/5071
# "connection failed: FATAL: remaining connection slots are reserved for non-replication superuser connections"  # noqa: ERA001
DATABASES["default"]["CONN_MAX_AGE"] = env.int("CONN_MAX_AGE", default=0)
  1. requirements.txt
psycopg[pool]==3.1.18  # https://github.com/psycopg/psycopg
django-db-connection-pool[psycopg3]==1.2.5  # https://github.com/altairbow/django-db-connection-pool

Screenshot 2024-05-17 at 20 05 22

agusmakmun commented 3 months ago

The upcoming errors that occurs is ProgrammingError the last operation didn't produce a result. \ Even after changing the ORM of .exists() to .count() > 0.

Related issue: https://github.com/psycopg/psycopg/discussions/417

foarsitter commented 3 months ago

@agusmakmun is CONN_MAX_AGE required when a pool is involved? That shouldn't have to be the case I suppose.

agusmakmun commented 3 months ago

For this case, was resolved after downgrade the python fromdocker.io/python:3.12.3-slim-bookworm topython:3.10.10-slim-bullseye.

I also consulate with ChatGPT with this query:

For a database server with 2 cores and 2GB of memory, how many POOL_SIZE, MAX_OVERFLOW and RECYCLE. I need numbers only.

DATABASES["default"]["ENGINE"] = "dj_db_conn_pool.backends.postgresql"
DATABASES["default"]["POOL_OPTIONS"] = {
    "POOL_SIZE": 3,
    "MAX_OVERFLOW": 3,
    "RECYCLE": 1800,  # 30 minutes
}

DATABASES["default"]["CONN_MAX_AGE"] = env.int("CONN_MAX_AGE", default=0)

Until deployed to production, no issues so far. Not sure which one is correct, either decrease the pool_size with propper size, or downgrade the python.

foarsitter commented 3 months ago

You can certain remove CONN_MAX_AGE since it defaults to 0.

Saurav-Paul commented 3 months ago

The issue finally resolved after implementing the django-db-connection-pool + set the CONN_MAX_AGE=0

  1. settings.py
DATABASES = {"default": env.db("DATABASE_URL")}
DATABASES["default"]["ATOMIC_REQUESTS"] = True
# https://github.com/altairbow/django-db-connection-pool?tab=readme-ov-file#postgresql
DATABASES["default"]["ENGINE"] = "dj_db_conn_pool.backends.postgresql"
DATABASES["default"]["POOL_OPTIONS"] = {
    "POOL_SIZE": 10,
    "MAX_OVERFLOW": 10,
    "RECYCLE": 24 * 60 * 60,
}

# Set to 0 to disable long(er) living connections
# https://docs.djangoproject.com/en/dev/ref/settings/#std-setting-CONN_MAX_AGE
# https://github.com/cookiecutter/cookiecutter-django/issues/5071
# "connection failed: FATAL: remaining connection slots are reserved for non-replication superuser connections"  # noqa: ERA001
DATABASES["default"]["CONN_MAX_AGE"] = env.int("CONN_MAX_AGE", default=0)
  1. requirements.txt
psycopg[pool]==3.1.18  # https://github.com/psycopg/psycopg
django-db-connection-pool[psycopg3]==1.2.5  # https://github.com/altairbow/django-db-connection-pool

Screenshot 2024-05-17 at 20 05 22

Thank you so much.. I was suffering from this issue.

agusmakmun commented 3 months ago

Guys, FYI, until now I still facing this issue some how.

foarsitter commented 3 months ago

And you did increase max_connections to 1000?

agusmakmun commented 2 months ago

Guys, FYI, our issue finally resolved after increasing the PostgreSQL database from default 100 to max_connections=300. and change the DATABASES["default"]["CONN_MAX_AGE"] = env.int("CONN_MAX_AGE", default=0) also remove the additional database pool setup to use default one.

My last issue is because of max_connections not set properly.