Open big-andy-coates opened 4 years ago
Notes from incident investigation to give some background and colour:
ksql.streams.producer.max.block.ms=9223372036854775807
. This means the metadata request never times out. Instead the query is stuck in the PENDING_SHUTDOWN
state. This explains how this incident get stuck PENDING_SHUTDOWN
.
Once in the stuck PENDING_SHUTDOWN
state, any call to KafkaStreams.close
, either with a timeout or not, will block indefinitely (Kafka-streams bug, with fix: https://github.com/apache/kafka/pull/7814). Thanks to @bbejeck for confirming this!
If the user submitted a TERMINATE
call through the CLI or RESTful API this may well cause the node to become unresponsive, as the main CommandRunner thread would block indefinitely closing the query.
If the user tries to shutdown KSQL, (i.e. on Prem), then KSQL will call KafkaStreams.close
on each running query. If any queries are stuck in PENDING_SHUTDOWN
the process will never exit. We’ve seen this reported by several users, (though I can’t find the associated Github issues. I know some were reported on stack overflow).
Restarting a server with persistent queries stuck in this state will not fix the issue. Those same queries will be restarted and start spewing warnings again. Once the alert time limit is hit it will raise a new alert.
There have been other incidents where persistent queries have got stuck in this state. I don’t know if anyone checked if the server had been bounced when investigating. They could be the same issue. Alternatively, deleting a sink or internal topic from under a running query would likely also put the query into the same state, as the metadata will eventually timeout and the refresh would fail and the query would block indefinitely on the refresh. Though I’ve not confirmed this, (task to confirm: https://github.com/confluentinc/ksql/issues/4267)
There is a similar issue for transient queries. Transient queries aren’t recovered on a restart and don’t have a sink topic. However, it’s possible the cause is intermediate topics being deleted from under the query, which may result in the same issue. (again, https://github.com/confluentinc/ksql/issues/4267 is the task to confirm this).
@bbejeck made a recent change to KSQL to call KafkaStreams.close
with a configurable timeout duration. However, this is not enough. The close call will still block indefinitely, due to the Long.MAX_VALUE
timeout on the metadata fetch.
An obvious fix might appear to be setting ksql.streams.producer.max.block.ms
to something other than Long.MAX_VALUE
. However, I’m assuming this is set to this value to allow KSQL queries to handle transient connectivity issues with Kafka, (@vinothchandar may be able to confirm this as I think he added this setting). If the setting was set to a lower value, any connectivity issue that hit the timeout would see queries transition to ERROR
and I don’t believe we currently have any automated or manual way to recover them, except to bounce the node. If we were to introduce some such automated recovery mechanism, then we could reduce ksql.streams.producer.max.block.ms
to a more normal value.
Possible fix 1: add automated recovery of queries that have failed due to connectivity issues and reduce ksql.streams.producer.max.block.ms
We could not restart any query on recovery where its source or sink topics no longer exist. However, that would only solve this specific issue, and would stop a query from recovering by recreating its sink topic if auto create was enabled, (which the user may want). Hence I don’t think this is a good solution.
If the same issue occurs when topics are deleted from under a running query, which seems likely, but is unconfirmed, then any fix should also cover this case. Such a fix could also avoid the same issue with transient queries. This may take the form of some periodic, or event driven (there is a state listener), check to detect any queries in PENDING_SHUTDOWN
which also have associated topics, (source, internal or sink), that have been deleted. Such queries should be closed, or forced into an ERROR state, to avoid them spamming the logs. Of course, closing the queries will only work once the KS fix is in.
Detecting queries that transition to PENDING_SHUTDOWN
and have missing topics would also cover the restart case too. The downside would be we’d still start the query, which may recreate internal topics that the user has previously deleted. However, as the query is still active I don’t really see this as much of an issue: the user can stop the query / drop the view if they want to avoid this.
Possible fix 2: detect queries stuck in PENDING_SHUTDOWN
, which also have missing topics and close them. (Requires KS fix: https://github.com/apache/kafka/pull/7814)
First things first we should:
Tracked by: https://github.com/confluentinc/ksql/issues/4267
Add some log lines to indicate when recovery starts and stops, as this would of helped in the investigation.
Tracked by: https://github.com/confluentinc/ksql/issues/4269
Have KS handle missing topics gracefully: https://issues.apache.org/jira/browse/KAFKA-9416 Have Kafka Producer.close handle missing topic: https://issues.apache.org/jira/browse/KAFKA-9398
Thanks for the detailed analysis, @big-andy-coates ! It's very helpful!
On a restart there is no check that the source or sink topics exist, (KSQL bug), so the query is started.
Is there an issue for this bug? Without fixing that issue, there is no recourse for a user who deletes a sink or internal topic from under a running query right? Even on a restart the query will get back into a stuck state.
It should be fairly simple to either abort the restart process or mark these queries in ERROR as soon as we detect the sink /internal topics are deleted during startup.
Possible fix 2: detect queries stuck in PENDING_SHUTDOWN, which also have missing topics and close them. (Requires KS fix: apache/kafka#7814)
I don't think we should add handling in KSQL to detect/resolve this. It's not a KSQL-specific problem, and makes assumptions about the internals of streams (e.g. what the internal topics are, what the possible error conditions are). Streams should be able to detect that the underlying topic is inaccessible and gracefully transition the query to an error state.
Thanks for the detailed analysis, @big-andy-coates ! It's very helpful!
On a restart there is no check that the source or sink topics exist, (KSQL bug), so the query is started.
Is there an issue for this bug? Without fixing that issue, there is no recourse for a user who deletes a sink or internal topic from under a running query right? Even on a restart the query will get back into a stuck state.
It should be fairly simple to either abort the restart process or mark these queries in ERROR as soon as we detect the sink /internal topics are deleted during startup.
The covered above, I don't think we should add code to KSQL to specifically handle this on start up. Any such fix would only fix half the problem. Better to fix the root cause, using one of the two listed solutions, or another solution.
Possible fix 2: detect queries stuck in PENDING_SHUTDOWN, which also have missing topics and close them. (Requires KS fix: apache/kafka#7814)
I don't think we should add handling in KSQL to detect/resolve this. It's not a KSQL-specific problem, and makes assumptions about the internals of streams (e.g. what the internal topics are, what the possible error conditions are). Streams should be able to detect that the underlying topic is inaccessible and gracefully transition the query to an error state.
Agreed, I was thinking more as a temporary fix. How do you feel about option 1 @rodesai ?
Option 1 makes sense to me in general for handling failures from streams/clients.
We know from recent investigation that on recovery/restart, any query whose sink topic has been deleted will get stuck in
PENDING_SHUTDOWN
state and will block indefinitely inKafkaStreams.close
call, e.g. when the user tries to termine the query or shut the node down gracefully.We need a solution to this!