Open dhiaayachi opened 3 weeks ago
You're experiencing a deadlock issue during schema update jobs, particularly in production environments with heavy workloads and large databases. This is because the CREATE INDEX
statements within your SQL scripts are blocking, leading to deadlocks when executed against busy tables.
To resolve this, we can leverage PostgreSQL's CREATE INDEX CONCURRENTLY
statement, which allows index creation without blocking the table.
Here's how to implement the solution:
Update SQL Scripts: Modify your SQL scripts to use CREATE INDEX CONCURRENTLY
instead of CREATE INDEX
. This allows index creation without locking the table, reducing the chance of deadlocks.
For example, in schema/postgresql/v12/visibility/versioned/v1.6/fix_root_workflow_info.sql
, you would change:
CREATE INDEX by_root_workflow_id ON executions_visibility (namespace_id, root_workflow_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
CREATE INDEX by_root_run_id ON executions_visibility (namespace_id, root_run_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
to:
CREATE INDEX CONCURRENTLY by_root_workflow_id ON executions_visibility (namespace_id, root_workflow_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
CREATE INDEX CONCURRENTLY by_root_run_id ON executions_visibility (namespace_id, root_run_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
Update Your Deployment Process: Ensure that the schema update jobs in your deployment process are using these updated SQL scripts.
Test Thoroughly: Once the changes are in place, perform comprehensive tests in your staging or testing environments to verify that schema updates work correctly without causing deadlocks.
Additional Tips
Consider alternative solutions: If the CREATE INDEX CONCURRENTLY
statement doesn't fully resolve the issue, consider other approaches, such as:
Understand Temporal's Versioning: Review Temporal's versioning guide to ensure seamless updates in production environments.
Monitor and Analyze: Keep a close eye on your system's performance and the health of your Temporal Service.
By implementing these steps and following best practices, you should be able to mitigate deadlocks during schema update jobs in your production Temporal instances. Remember to test thoroughly before deploying any changes to your production environment.
Thank you for reporting this issue!
We understand that schema updates can be challenging, especially in production environments with heavy workloads. Your proposed solution of using CREATE INDEX CONCURRENTLY
is a great approach to minimize blocking and deadlocks during index creation.
We would be happy to help you test and implement this solution. Could you please provide the following information:
CREATE INDEX CONCURRENTLY
syntax.If you are using Temporal Cloud, we may be able to assist you directly through our Support team.
We run schema update jobs during our deployment process. The command looks like the following
Expected Behavior
In the testing/staging environment where our Temporal instances don't have much workload and where the persistency and visibility databases are both relatively small, migrations run without any issues with the desired effect:
Actual Behavior
For our production instances, however, which are quite active and where databases are in constant use (~1TB for persistence and ~120 Gb visibility), index creation fails constantly. Part of the equation is that we have deadlock detection mechanism and our PostgreSQL instances would terminate certain blocking queries running over a configured time limit. This is often the case when manipulating indices referencing busy tables:
The
error executing statement: driver: bad connection
error is just an indication that query killer was engaged to terminate the query which blocks longer than the allowed limit.Specifications
Proposed Solution
The proposed solution for the issue should be creating the indices concurrently, so the
CREATE INDEX
requests are not blocking. For exampleschema/postgresql/v12/visibility/versioned/v1.6/fix_root_workflow_info.sql
would have the following changes: