Open inviscid opened 1 month ago
When a query pod is terminated unexpectedly, the SQL client receives specific error messages such as:
{"error":{"code":400,"message":"query id QID canceled"}}
{ "code": 1043, "message": "canceled by client" }
{ "code": 2006, "message": "MySQL server has gone away" }
{ "code": 2013, "message": "Lost connection to MySQL server during query" }
[1][2].The query cluster handles the loss of a query pod without causing errors to the SQL client by using a heartbeat mechanism and dynamically discovering and managing nodes. The ClusterDiscovery
struct and its methods play a crucial role in this process. Here is how it works:
Heartbeat Mechanism: The ClusterHeartbeat
struct sends periodic heartbeats to ensure that the node is still active. If a node fails to send a heartbeat, it can be considered lost and removed from the cluster.
Dynamic Node Discovery: The ClusterDiscovery
struct has methods like discover
and drop_invalid_nodes
to dynamically discover active nodes and remove invalid ones, ensuring the cluster is always aware of the current state of its nodes.
Graceful Shutdown: The unregister_to_metastore
method ensures that a node is properly unregistered from the metastore during shutdown, preventing stale entries [3].
However, there is no specific mechanism mentioned for rolling back and retrying transactions when a node is lost. This might be an area for improvement or a feature request [3].
Additionally, there are configuration options available to control the behavior of the cluster during auto-scaling events, such as auto_compaction_imperfect_blocks_threshold
and auto_compaction_segments_limit
[4][5].
To continue talking to Dosu, mention @dosu.
This case was anticipated. We kindly request that the application handle the retry process.
Is it possible to handle this at the driver level - bendSQL?
Search before asking
Version
v1.2.635-nightly
What's Wrong?
This is probably a combination of bug and feature request. The problem is that a loss of a query pod/machine while it is running a query causes an error to be raised to the SQL client. This is particularly prevalent when a cluster is auto-scaled to handle a load then begins scaling back in once the load subsides. As pods are shutdown, any client running a query on the pod receives a broken connection error.
Since one of the important reasons to use a cluster rather than larger machines is resilience to compute resource loss, this seems like the query cluster should be able to handle a cluster member loss by routing the query to another available member transparently to the external client. I think the expectation is that the previous transaction would be rolled back and a new transaction started.
How to Reproduce?
Are you willing to submit PR?