Open marcocitus opened 1 year ago
The failure seems expected, but the error should definitely be part of the output of citus_rebalance_status. @hanefi can you check why it's not shown there?
@marcocitus Why does your output from citus_rebalance_status not contain any tasks with error
status. Are you sure you ran it after these errors happened?
@marcocitus Why does your output from citus_rebalance_status not contain any tasks with error status. Are you sure you ran it after these errors happened?
I have 0 context on how this feature works, but I assume it only shows running tasks. pg_dist_background_task does show the error.
In this scenario there are runnable tasks that retry up to 32 times. For example, here is such a task record from pg_dist_background_task
:
+-[ RECORD 4 ]+-------------------------------------------------------------------------------------------------------------------------------+
| job_id | 2 |
| task_id | 3 |
| owner | hanefi |
| pid | (null) |
| status | runnable |
| command | SELECT pg_catalog.citus_move_shard_placement(102008,2,1,'auto') |
| retry_count | 9 |
| not_before | 2023-01-31 15:09:55.716904+03 |
| message | ERROR: Moving shards to a node that shouldn't have a shard is not supported +|
| | HINT: Allow shards on the target node via SELECT * FROM citus_set_node_property('localhost', 9701, 'shouldhaveshards', true);+|
| | CONTEXT: Citus Background Task Queue Executor: hanefi/hanefi for (2/3) +|
| | |
+-------------+-------------------------------------------------------------------------------------------------------------------------------+
We used to only report details on tasks in error states. After #6683, we will also report errors in details jsonb column when there are runnable tasks with non-zero retry counts.
Here is how we generate the report details after #6683 :
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| jsonb_pretty |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| { +|
| "tasks": [ +|
| { +|
| "state": "runnable", +|
| "command": "SELECT pg_catalog.citus_move_shard_placement(102008,2,1,'auto')", +|
| "message": "ERROR: Moving shards to a node that shouldn't have a shard is not supported\nHINT: Allow shards on the target node via SELECT * FROM citus_set_node_property('localhost', 9701, 'shouldhaveshards', true);\nCONTEXT: Citus Background Task Queue Executor: hanefi/hanefi for (2/3)\n",+|
| "retried": 12, +|
| "task_id": 3 +|
| } +|
| ], +|
| "task_state_counts": { +|
| "blocked": 1, +|
| "runnable": 1 +|
| } +|
| } |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
(1 row)
I have 0 context on how this feature works, but I assume it only shows running tasks. pg_dist_background_task does show the error.
Let me clarify this.
citus_rebalance_status()
and citus_job_status()
show details on not only running tasks, but also on tasks in error state. This particular problem happens where the task is stuck in runnable
state. Everytime we try to run this, we get an error message. If the user is lucky enough to run citus_rebalance_status()
in the small time frame when the task is running, they can see the details. Or similarly they can see the details after 32 retries are over and the task is now in error state.
After changing the shouldhaveshards of a node that was receiving shard moves, moves kept failing:
According to citus_rebalance_status(), the rebalance remained stuck at the same number of tasks after at least 6 tries.
Also it would be nice to reflect errors from pg_dist_background_task in rebalance status.