bug: Query Error when one query pod is shutdown in cluster

inviscid commented 1 month ago

Search before asking

[X] I had searched in the issues and found no similar issues.

Version

v1.2.635-nightly

What's Wrong?

This is probably a combination of bug and feature request. The problem is that a loss of a query pod/machine while it is running a query causes an error to be raised to the SQL client. This is particularly prevalent when a cluster is auto-scaled to handle a load then begins scaling back in once the load subsides. As pods are shutdown, any client running a query on the pod receives a broken connection error.

Since one of the important reasons to use a cluster rather than larger machines is resilience to compute resource loss, this seems like the query cluster should be able to handle a cluster member loss by routing the query to another available member transparently to the external client. I think the expectation is that the previous transaction would be rolled back and a new transaction started.

How to Reproduce?

Scale the query cluster to more than one member
Begin running queries so that both members are answering query requests
Terminate one of the query members without a graceful shutdown
Observe the errors returned to SQL client for queries executing on the terminated member

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

dosubot[bot] commented 1 month ago

When a query pod is terminated unexpectedly, the SQL client receives specific error messages such as:

{"error":{"code":400,"message":"query id QID canceled"}}
{ "code": 1043, "message": "canceled by client" }
{ "code": 2006, "message": "MySQL server has gone away" }
{ "code": 2013, "message": "Lost connection to MySQL server during query" } ^[1]^[2].

The query cluster handles the loss of a query pod without causing errors to the SQL client by using a heartbeat mechanism and dynamically discovering and managing nodes. The ClusterDiscovery struct and its methods play a crucial role in this process. Here is how it works:

Heartbeat Mechanism: The ClusterHeartbeat struct sends periodic heartbeats to ensure that the node is still active. If a node fails to send a heartbeat, it can be considered lost and removed from the cluster.
Dynamic Node Discovery: The ClusterDiscovery struct has methods like discover and drop_invalid_nodes to dynamically discover active nodes and remove invalid ones, ensuring the cluster is always aware of the current state of its nodes.
Graceful Shutdown: The unregister_to_metastore method ensures that a node is properly unregistered from the metastore during shutdown, preventing stale entries ^[3].

However, there is no specific mechanism mentioned for rolling back and retrying transactions when a node is lost. This might be an area for improvement or a feature request ^[3].

Additionally, there are configuration options available to control the behavior of the cluster during auto-scaling events, such as auto_compaction_imperfect_blocks_threshold and auto_compaction_segments_limit ^[4]^[5].

To continue talking to Dosu, mention @dosu.

wubx commented 1 month ago

This case was anticipated. We kindly request that the application handle the retry process.

rad-pat commented 1 month ago

Is it possible to handle this at the driver level - bendSQL?

databendlabs / databend