dhiaayachi / temporal

Temporal service
https://docs.temporal.io
MIT License
0 stars 0 forks source link

​ Version v1.20.1 had slow query in temporal_visibility #400

Open dhiaayachi opened 2 months ago

dhiaayachi commented 2 months ago

After upgrading temporal to version 1.20.1, I found that there was a slow log when querying temporal_visibility. After running for a while, "temporal-history" started to frequently report the error "context deadline exceeded", andMySQL started to report SlowLogs.

{"level":"error","ts":"2023-04-07T07:12:33.752Z","msg":"Operation failed with an error.","error":"context deadline exceeded","logging-call-at":"visiblity_manager_metrics.go:258","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:150\ngo.temporal.io/server/common/persistence/visibility.(*visibilityManagerMetrics).updateErrorMetric\n\t/home/builder/temporal/common/persistence/visibility/visiblity_manager_metrics.go:258\ngo.temporal.io/server/common/persistence/visibility.(*visibilityManagerMetrics).RecordWorkflowExecutionClosed\n\t/home/builder/temporal/common/persistence/visibility/visiblity_manager_metrics.go:102\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).recordCloseExecution\n\t/home/builder/temporal/service/history/visibilityQueueTaskExecutor.go:443\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).processCloseExecution\n\t/home/builder/temporal/service/history/visibilityQueueTaskExecutor.go:392\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).Execute\n\t/home/builder/temporal/service/history/visibilityQueueTaskExecutor.go:120\ngo.temporal.io/server/service/history/queues.(*executableImpl).Execute\n\t/home/builder/temporal/service/history/queues/executable.go:211\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:231\ngo.temporal.io/server/common/backoff.ThrottleRetry.func1\n\t/home/builder/temporal/common/backoff/retry.go:175\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:199\ngo.temporal.io/server/common/backoff.ThrottleRetry\n\t/home/builder/temporal/common/backoff/retry.go:176\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:241\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:217"}

This SQL statement was obtained from “show processlist”.

SELECT ev.namespace_id, ev.run_id, ev.workflow_type_name, ev.workflow_id, ev.start_time, ev.execution_time, ev.status, ev.close_time, ev.history_length, ev.memo, ev.encoding, ev.task_queue, ev.search_attributes FROM executions_visibility ev LEFT JOIN custom_search_attributes USING (namespace_id, run_id) WHERE namespace_id = 'f458436c-e0b6-47d6-ac5d-a633ef7d84bb' AND TemporalNamespaceDivision is null ORDER BY coalesce(close_time, cast('9999-12-31 23:59:59' as datetime)) DESC, start_time DESC, run_id LIMIT 1000

There are approximately 3 million rows of data under this namespace.

Expected Behavior

​ Based on the index involvement in the table structure, MySQL should use the "default_idx" index. and this is the execution result after I forced the use of the "default_idx" index.

mysql> explain analyze SELECT ev.namespace_id, ev.run_id, ev.workflow_type_name, ev.workflow_id, ev.start_time, ev.execution_time, ev.status, ev.close_time, ev.history_length, ev.memo, ev.encoding, ev.task_queue, ev.search_attributes FROM executions_visibility ev USE INDEX(default_idx) LEFT JOIN custom_search_attributes USING (namespace_id, run_id) WHERE namespace_id = 'f458436c-e0b6-47d6-ac5d-a633ef7d84bb' AND TemporalNamespaceDivision is null ORDER BY coalesce(close_time, cast('9999-12-31 23:59:59' as datetime)) DESC, start_time DESC, run_id LIMIT 1000\G;
*************************** 1. row ***************************
EXPLAIN: -> Limit: 1000 row(s)  (cost=115154.45 rows=1000) (actual time=0.073..18.884 rows=1000 loops=1)
    -> Nested loop left join  (cost=115154.45 rows=154992) (actual time=0.072..18.796 rows=1000 loops=1)
        -> Filter: ((ev.namespace_id = 'f458436c-e0b6-47d6-ac5d-a633ef7d84bb') and (ev.TemporalNamespaceDivision is null))  (cost=60907.21 rows=154992) (actual time=0.057..12.063 rows=1000 loops=1)
            -> Index lookup on ev using default_idx (namespace_id='f458436c-e0b6-47d6-ac5d-a633ef7d84bb')  (cost=60907.21 rows=1549921) (actual time=0.054..11.499 rows=1000 loops=1)
        -> Filter: (custom_search_attributes.namespace_id = 'f458436c-e0b6-47d6-ac5d-a633ef7d84bb')  (cost=0.25 rows=1) (actual time=0.007..0.007 rows=0 loops=1000)
            -> Single-row covering index lookup on custom_search_attributes using PRIMARY (namespace_id='f458436c-e0b6-47d6-ac5d-a633ef7d84bb', run_id=ev.run_id)  (cost=0.25 rows=1) (actual time=0.006..0.006 rows=0 loops=1000)

1 row in set (0.02 sec)

mysql> explain SELECT ev.namespace_id, ev.run_id, ev.workflow_type_name, ev.workflow_id, ev.start_time, ev.execution_time, ev.status, ev.close_time, ev.history_length, ev.memo, ev.encoding, ev.task_queue, ev.search_attributes FROM executions_visibility ev USE INDEX(default_idx) LEFT JOIN custom_search_attributes USING (namespace_id, run_id) WHERE namespace_id = 'f458436c-e0b6-47d6-ac5d-a633ef7d84bb' AND TemporalNamespaceDivision is null ORDER BY coalesce(close_time, cast('9999-12-31 23:59:59' as datetime)) DESC, start_time DESC, run_id LIMIT 1000\G;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ev
   partitions: NULL
         type: ref
possible_keys: default_idx
          key: default_idx
      key_len: 256
          ref: const
         rows: 1549921
     filtered: 10.00
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: custom_search_attributes
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY,by_bool_01,by_bool_02,by_bool_03,by_datetime_01,by_datetime_02,by_datetime_03,by_double_01,by_double_02,by_double_03,by_int_01,by_int_02,by_int_03,by_keyword_01,by_keyword_02,by_keyword_03,by_keyword_04,by_keyword_05,by_keyword_06,by_keyword_07,by_keyword_08,by_keyword_09,by_keyword_10,by_keyword_list_01,by_keyword_list_02,by_keyword_list_03
          key: PRIMARY
      key_len: 512
          ref: const,temporal_visibility.ev.run_id
         rows: 1
     filtered: 100.00
        Extra: Using where; Using index
2 rows in set, 1 warning (0.01 sec)

Actual Behavior

MySQL did not use the "default_idx" index as I forced, instead, it used the PrimaryKey as index


mysql> explain SELECT ev.namespace_id, ev.run_id, ev.workflow_type_name, ev.workflow_id, ev.start_time, ev.execution_time, ev.status, ev.close_time, ev.history_length, ev.memo, ev.encoding, ev.task_queue, ev.search_attributes FROM executions_visibility ev LEFT JOIN custom_search_attributes USING (namespace_id, run_id) WHERE namespace_id = 'f458436c-e0b6-47d6-ac5d-a633ef7d84bb' AND TemporalNamespaceDivision is null ORDER BY coalesce(close_time, cast('9999-12-31 23:59:59' as datetime)) DESC, start_time DESC, run_id LIMIT 1000\G;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ev
   partitions: NULL
         type: ref
possible_keys: PRIMARY,default_idx,by_execution_time,by_workflow_id,by_workflow_type,by_status,by_history_length,by_task_queue,by_temporal_change_version,by_binary_checksums,by_batcher_user,by_temporal_scheduled_start_time,by_temporal_scheduled_by_id,by_temporal_schedule_paused,by_temporal_namespace_division
          key: PRIMARY
      key_len: 256
          ref: const
         rows: 1549921
     filtered: 10.00
        Extra: Using where; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: custom_search_attributes
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY,by_bool_01,by_bool_02,by_bool_03,by_datetime_01,by_datetime_02,by_datetime_03,by_double_01,by_double_02,by_double_03,by_int_01,by_int_02,by_int_03,by_keyword_01,by_keyword_02,by_keyword_03,by_keyword_04,by_keyword_05,by_keyword_06,by_keyword_07,by_keyword_08,by_keyword_09,by_keyword_10,by_keyword_list_01,by_keyword_list_02,by_keyword_list_03
          key: PRIMARY
      key_len: 512
          ref: const,temporal_visibility.ev.run_id
         rows: 1
     filtered: 100.00
        Extra: Using where; Using index
2 rows in set, 1 warning (0.01 sec)

mysql> explain analyze SELECT ev.namespace_id, ev.run_id, ev.workflow_type_name, ev.workflow_id, ev.start_time, ev.execution_time, ev.status, ev.close_time, ev.history_length, ev.memo, ev.encoding, ev.task_queue, ev.search_attributes FROM executions_visibility ev LEFT JOIN custom_search_attributes USING (namespace_id, run_id) WHERE namespace_id = 'f458436c-e0b6-47d6-ac5d-a633ef7d84bb' AND TemporalNamespaceDivision is null ORDER BY coalesce(close_time, cast('9999-12-31 23:59:59' as datetime)) DESC, start_time DESC, run_id LIMIT 1000\G;
*************************** 1. row ***************************
EXPLAIN: -> Limit: 1000 row(s)  (cost=210064.29 rows=1000) (actual time=20665.083..20675.554 rows=1000 loops=1)
    -> Nested loop left join  (cost=210064.29 rows=1549921) (actual time=20665.081..20675.468 rows=1000 loops=1)
        -> Sort: coalesce(close_time,cast(_utf8mb4'9999-12-31 23:59:59' as datetime)) DESC, ev.start_time DESC, ev.run_id  (cost=16324.17 rows=1549921) (actual time=20665.033..20665.540 rows=1000 loops=1)
            -> Filter: ((ev.namespace_id = 'f458436c-e0b6-47d6-ac5d-a633ef7d84bb') and (ev.TemporalNamespaceDivision is null))  (actual time=0.038..8314.438 rows=3205388 loops=1)
                -> Index lookup on ev using PRIMARY (namespace_id='f458436c-e0b6-47d6-ac5d-a633ef7d84bb')  (actual time=0.035..6668.868 rows=3205388 loops=1)
        -> Filter: (custom_search_attributes.namespace_id = 'f458436c-e0b6-47d6-ac5d-a633ef7d84bb')  (cost=0.25 rows=1) (actual time=0.010..0.010 rows=0 loops=1000)
            -> Single-row covering index lookup on custom_search_attributes using PRIMARY (namespace_id='f458436c-e0b6-47d6-ac5d-a633ef7d84bb', run_id=ev.run_id)  (cost=0.25 rows=1) (actual time=0.010..0.010 rows=0 loops=1000)

1 row in set (20.96 sec)

It took 20 seconds to execute the SQL query.

Steps to Reproduce the Problem

  1. Upgrade to v1.20.1
  2. About 300w rows in temporal_visibility
  3. Run temporal
  4. Execute show processlist in MySQL

Specifications