dhiaayachi / temporal

Temporal service
https://docs.temporal.io
MIT License
0 stars 0 forks source link

PostgreSQL schema update fails on busy Temporal instances #109

Open dhiaayachi opened 3 weeks ago

dhiaayachi commented 3 weeks ago

We run schema update jobs during our deployment process. The command looks like the following

command: ["temporal-sql-tool", "--plugin", "postgres", "update-schema", "-d", "{{ $storeConfig.schemaPath }}"]

Expected Behavior

In the testing/staging environment where our Temporal instances don't have much workload and where the persistency and visibility databases are both relatively small, migrations run without any issues with the desired effect:

2024-07-11T13:38:03.670Z    INFO    UpdateSchemaTask started    {"config": {"DBName":"","TargetVersion":"","SchemaDir":"/etc/temporal/schema/postgresql/v12/visibility/versioned","SchemaName":"","IsDryRun":false}, "logging-call-at": "updatetask.go:104"}
2024-07-11T13:38:03.675Z    DEBUG   Schema Dirs: [v1.6] {"logging-call-at": "updatetask.go:212"}
2024-07-11T13:38:03.675Z    INFO    Processing schema file: v1.6/fix_root_workflow_info.sql {"logging-call-at": "updatetask.go:256"}
2024-07-11T13:38:03.675Z    DEBUG   running 1 updates for current version 1.5   {"logging-call-at": "updatetask.go:138"}
2024-07-11T13:38:03.675Z    DEBUG   ---- Executing updates for version 1.6 ---- {"logging-call-at": "updatetask.go:157"}
2024-07-11T13:38:03.675Z    DEBUG   DROP INDEX by_root_workflow_id; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.068Z    DEBUG   DROP INDEX by_root_run_id;  {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.794Z    DEBUG   ALTER TABLE executions_visibility DROP COLUMN root_workflow_id; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.798Z    DEBUG   ALTER TABLE executions_visibility DROP COLUMN root_run_id;  {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.799Z    DEBUG   ALTER TABLE executions_visibility ADD COLUMN root_workflow_id VARCHAR(255) NOT NULL DEFAULT ''; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.802Z    DEBUG   ALTER TABLE executions_visibility ADD COLUMN root_run_id VARCHAR(255) NOT NULL DEFAULT '';  {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.804Z    DEBUG   CREATE INDEX by_root_workflow_id ON executions_visibility (namespace_id, root_workflow_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);    {"logging-call-at": "updatetask.go:159"}
...
2024-07-11T13:38:49.703Z    INFO    UpdateSchemaTask done   {"logging-call-at": "updatetask.go:127"}

Actual Behavior

For our production instances, however, which are quite active and where databases are in constant use (~1TB for persistence and ~120 Gb visibility), index creation fails constantly. Part of the equation is that we have deadlock detection mechanism and our PostgreSQL instances would terminate certain blocking queries running over a configured time limit. This is often the case when manipulating indices referencing busy tables:

2024-07-11T13:38:03.670Z    INFO    UpdateSchemaTask started    {"config": {"DBName":"","TargetVersion":"","SchemaDir":"/etc/temporal/schema/postgresql/v12/visibility/versioned","SchemaName":"","IsDryRun":false}, "logging-call-at": "updatetask.go:104"}
2024-07-11T13:38:03.675Z    DEBUG   Schema Dirs: [v1.6] {"logging-call-at": "updatetask.go:212"}
2024-07-11T13:38:03.675Z    INFO    Processing schema file: v1.6/fix_root_workflow_info.sql {"logging-call-at": "updatetask.go:256"}
2024-07-11T13:38:03.675Z    DEBUG   running 1 updates for current version 1.5   {"logging-call-at": "updatetask.go:138"}
2024-07-11T13:38:03.675Z    DEBUG   ---- Executing updates for version 1.6 ---- {"logging-call-at": "updatetask.go:157"}
2024-07-11T13:38:03.675Z    DEBUG   DROP INDEX by_root_workflow_id; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.068Z    DEBUG   DROP INDEX by_root_run_id;  {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.794Z    DEBUG   ALTER TABLE executions_visibility DROP COLUMN root_workflow_id; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.798Z    DEBUG   ALTER TABLE executions_visibility DROP COLUMN root_run_id;  {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.799Z    DEBUG   ALTER TABLE executions_visibility ADD COLUMN root_workflow_id VARCHAR(255) NOT NULL DEFAULT ''; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.802Z    DEBUG   ALTER TABLE executions_visibility ADD COLUMN root_run_id VARCHAR(255) NOT NULL DEFAULT '';  {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.804Z    DEBUG   CREATE INDEX by_root_workflow_id ON executions_visibility (namespace_id, root_workflow_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);    {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:41.793Z    ERROR   Unable to update SQL schema.    {"error": "error executing statement: driver: bad connection", "logging-call-at": "handler.go:78"}

The error executing statement: driver: bad connection error is just an indication that query killer was engaged to terminate the query which blocks longer than the allowed limit.

Specifications

Proposed Solution

The proposed solution for the issue should be creating the indices concurrently, so the CREATE INDEX requests are not blocking. For example schema/postgresql/v12/visibility/versioned/v1.6/fix_root_workflow_info.sql would have the following changes:

--- a/schema/postgresql/v12/visibility/versioned/v1.6/fix_root_workflow_info.sql
+++ b/schema/postgresql/v12/visibility/versioned/v1.6/fix_root_workflow_info.sql
ALTER TABLE executions_visibility ADD COLUMN root_workflow_id VARCHAR(255) NOT NULL DEFAULT '';
ALTER TABLE executions_visibility ADD COLUMN root_run_id      VARCHAR(255) NOT NULL DEFAULT '';
- CREATE INDEX by_root_workflow_id  ON executions_visibility (namespace_id, root_workflow_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
- CREATE INDEX by_root_run_id       ON executions_visibility (namespace_id, root_run_id,      (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
+ CREATE INDEX CONCURRENTLY by_root_workflow_id  ON executions_visibility (namespace_id, root_workflow_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
+ CREATE INDEX CONCURRENTLY by_root_run_id       ON executions_visibility (namespace_id, root_run_id,      (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
dhiaayachi commented 1 week ago

Solution for Temporal Schema Update Deadlock Issue

You're experiencing a deadlock issue during schema update jobs, particularly in production environments with heavy workloads and large databases. This is because the CREATE INDEX statements within your SQL scripts are blocking, leading to deadlocks when executed against busy tables.

To resolve this, we can leverage PostgreSQL's CREATE INDEX CONCURRENTLY statement, which allows index creation without blocking the table.

Here's how to implement the solution:

  1. Update SQL Scripts: Modify your SQL scripts to use CREATE INDEX CONCURRENTLY instead of CREATE INDEX. This allows index creation without locking the table, reducing the chance of deadlocks.

    For example, in schema/postgresql/v12/visibility/versioned/v1.6/fix_root_workflow_info.sql, you would change:

    CREATE INDEX by_root_workflow_id  ON executions_visibility (namespace_id, root_workflow_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
    CREATE INDEX by_root_run_id       ON executions_visibility (namespace_id, root_run_id,      (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);

    to:

    CREATE INDEX CONCURRENTLY by_root_workflow_id  ON executions_visibility (namespace_id, root_workflow_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
    CREATE INDEX CONCURRENTLY by_root_run_id       ON executions_visibility (namespace_id, root_run_id,      (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
  2. Update Your Deployment Process: Ensure that the schema update jobs in your deployment process are using these updated SQL scripts.

  3. Test Thoroughly: Once the changes are in place, perform comprehensive tests in your staging or testing environments to verify that schema updates work correctly without causing deadlocks.

Additional Tips

By implementing these steps and following best practices, you should be able to mitigate deadlocks during schema update jobs in your production Temporal instances. Remember to test thoroughly before deploying any changes to your production environment.

dhiaayachi commented 1 week ago

Thank you for reporting this issue!

We understand that schema updates can be challenging, especially in production environments with heavy workloads. Your proposed solution of using CREATE INDEX CONCURRENTLY is a great approach to minimize blocking and deadlocks during index creation.

We would be happy to help you test and implement this solution. Could you please provide the following information:

  1. What is the specific version of PostgreSQL you are using? This will help us ensure compatibility with the CREATE INDEX CONCURRENTLY syntax.
  2. Are you using a self-hosted Temporal Service or Temporal Cloud? This will help us determine the best way to assist you.

If you are using Temporal Cloud, we may be able to assist you directly through our Support team.