Open kgorszczyk opened 3 weeks ago
Interestingly, the runtime of my MAIN_EXECUTION flow was reduced from nearly 3 hours to 1 hour after truncating the tables.
The good news for this is that 3.0 avoids a lot of this database traffic during execution, so flow run execution will be less impacted by DB performance on versions 3.0+.
That being said, this is something we should be better about alerting on from within the server somehow - we plan to focus on performance tuning in the next few months I will be sure to include this issue in scope for that effort.
Bug summary
First check
Bug summary About two months ago, I successfully completed the migration to Prefect 2. Initially, all flow runs were logged very verbosely to ensure thorough error analysis. Afterward, all flows ran daily with mostly normal logging.
Two days ago, I noticed that after a new deployment, scheduled jobs started disappearing one after the other and were not being rescheduled despite an active schedule.
In the logs of the Prefect server container, I found the following traceback:
After reviewing the Postgres database, I discovered that the "task_run_state" and "flow_run_state" tables had grown to over 40GB in size.
As a test, I truncated both tables, and the scheduler was able to plan new jobs again and ran without any timeouts.
I suspect that due to the strictly scheduled loop interval (5 seconds), Postgres on slower/heavily loaded systems may not be able to deliver results in time, causing asyncio to fall into a timeout. As a result, no new jobs are being scheduled.
Version info (
prefect version
output)Additional context
Interestingly, the runtime of my MAIN_EXECUTION flow was reduced from nearly 3 hours to 1 hour after truncating the tables.