Open obendidi opened 3 months ago
Are you using a postgres database backend? If so how have you configured that, and do you see any errors in its logs?
And how are you starting your prefect server?
I have had strange errors like this when my VM hosting both services is under heavy load from my flows. I have started
When starting my postgres server in an apptainer I used
APPTAINERENV_POSTGRES_PASSWORD="$POSTGRES_PASS" APPTAINERENV_POSTGRES_DB="$POSTGRES_DB" APPTAINERENV_PGDATA="$POSTGRES_SCRATCH/pgdata" \
apptainer run --cleanenv --bind "$POSTGRES_SCRATCH":/var postgres_latest.sif -c max_connections=2096 -c shared_buffers=8000MB -c min_wal_size=8096 -c max_wal_size=32384 -c synchronous_commit=off -c wal_buffers=16MB
Setting -c synchronous_commit=off
was by far the biggest improvement to my prefect servers stability.
I also found that settings these in the environment that would run the prefect server also helped
export WEB_CONCURRENCY=32
export PREFECT_SQLALCHEMY_POOL_SIZE=30
export PREFECT_SQLALCHEMY_MAX_OVERFLOW=40
export PREFECT_API_DATABASE_TIMEOUT=60
export PREFECT_API_DATABASE_CONNECTION_TIMEOUT=60
#export PREFECT_SERVER_API_KEEPALIVE_TIMEOUT=15
Thanks for the help,
I'm using a postgres database (RDS with aurora serverless on aws) (PG version 14 to be exact). And i'm using the default pg 14 configuration.
Setting -c synchronous_commit=off was by far the biggest improvement to my prefect servers stability.
From what I've read, aws doesn't recommend disabling that (here), as that could potentially lead to losing transactions. have you noticed that when running it with synchronous_commit=off
?
Thanks for the env vars, I'm doing more or less the same:
ENV PREFECT_LOGGING_HANDLERS_CONSOLE_FORMATTER=json \
PREFECT_LOGGING_EXTRA_LOGGERS=fcv \
PREFECT_SQLALCHEMY_POOL_SIZE=100 \
PREFECT_SQLALCHEMY_MAX_OVERFLOW=100 \
PREFECT_API_DATABASE_TIMEOUT=60 \
PREFECT_API_DATABASE_CONNECTION_TIMEOUT=60 \
PREFECT_API_REQUEST_TIMEOUT=120
Bug summary
Hello everyone,
I'm reporting a bug that I've noticed in our prefect server that only happens during relatively higher loads (even though the container is only at a max of 40% CPU and 30% RAM)
During high load the UI is empty and on the network tab, API calls get a response
{"exception_message":"Service Unavailable"}
When checking the logs of the containers, I see errors like these below:or
It probably fails on other queries too.
Any clue on what I might be doing wrong here, or how I can mitigate this kind of errors ?
Version info (
prefect version
output)Additional context