Queries get stuck in `PENDING_SHUTDOWN` if topics deleted

big-andy-coates commented 4 years ago

We know from recent investigation that on recovery/restart, any query whose sink topic has been deleted will get stuck in PENDING_SHUTDOWN state and will block indefinitely in KafkaStreams.close call, e.g. when the user tries to termine the query or shut the node down gracefully.

We need a solution to this!

big-andy-coates commented 4 years ago

Notes from incident investigation to give some background and colour:

Root Cause

Looks like what is happening is that the user is deleting the sink topic without terminating the query.
On a restart there is no check that the source or sink topics exist, (KSQL bug), so the query is started.
When a message is seen on the source topic, KSQL processes and tries to write it to the sink. To do so, the Producer must get metadata for the topic. But the topic does not exist. Hence the warnings about topics not exists.
Normally, the metadata fetch would timeout after 1 minute. However, we set: ksql.streams.producer.max.block.ms=9223372036854775807. This means the metadata request never times out. Instead the query is stuck in the PENDING_SHUTDOWNstate.

This explains how this incident get stuck PENDING_SHUTDOWN.

Implications and related issues

Once in the stuck PENDING_SHUTDOWN state, any call to KafkaStreams.close, either with a timeout or not, will block indefinitely (Kafka-streams bug, with fix: https://github.com/apache/kafka/pull/7814). Thanks to @bbejeck for confirming this!

If the user submitted a TERMINATE call through the CLI or RESTful API this may well cause the node to become unresponsive, as the main CommandRunner thread would block indefinitely closing the query.

If the user tries to shutdown KSQL, (i.e. on Prem), then KSQL will call KafkaStreams.close on each running query. If any queries are stuck in PENDING_SHUTDOWN the process will never exit. We’ve seen this reported by several users, (though I can’t find the associated Github issues. I know some were reported on stack overflow).

Restarting a server with persistent queries stuck in this state will not fix the issue. Those same queries will be restarted and start spewing warnings again. Once the alert time limit is hit it will raise a new alert.

There have been other incidents where persistent queries have got stuck in this state. I don’t know if anyone checked if the server had been bounced when investigating. They could be the same issue. Alternatively, deleting a sink or internal topic from under a running query would likely also put the query into the same state, as the metadata will eventually timeout and the refresh would fail and the query would block indefinitely on the refresh. Though I’ve not confirmed this, (task to confirm: https://github.com/confluentinc/ksql/issues/4267)

There is a similar issue for transient queries. Transient queries aren’t recovered on a restart and don’t have a sink topic. However, it’s possible the cause is intermediate topics being deleted from under the query, which may result in the same issue. (again, https://github.com/confluentinc/ksql/issues/4267 is the task to confirm this).

@bbejeck made a recent change to KSQL to call KafkaStreams.close with a configurable timeout duration. However, this is not enough. The close call will still block indefinitely, due to the Long.MAX_VALUE timeout on the metadata fetch.

Fix

An obvious fix might appear to be setting ksql.streams.producer.max.block.ms to something other than Long.MAX_VALUE. However, I’m assuming this is set to this value to allow KSQL queries to handle transient connectivity issues with Kafka, (@vinothchandar may be able to confirm this as I think he added this setting). If the setting was set to a lower value, any connectivity issue that hit the timeout would see queries transition to ERROR and I don’t believe we currently have any automated or manual way to recover them, except to bounce the node. If we were to introduce some such automated recovery mechanism, then we could reduce ksql.streams.producer.max.block.ms to a more normal value.

Possible fix 1: add automated recovery of queries that have failed due to connectivity issues and reduce ksql.streams.producer.max.block.ms

We could not restart any query on recovery where its source or sink topics no longer exist. However, that would only solve this specific issue, and would stop a query from recovering by recreating its sink topic if auto create was enabled, (which the user may want). Hence I don’t think this is a good solution.

If the same issue occurs when topics are deleted from under a running query, which seems likely, but is unconfirmed, then any fix should also cover this case. Such a fix could also avoid the same issue with transient queries. This may take the form of some periodic, or event driven (there is a state listener), check to detect any queries in PENDING_SHUTDOWN which also have associated topics, (source, internal or sink), that have been deleted. Such queries should be closed, or forced into an ERROR state, to avoid them spamming the logs. Of course, closing the queries will only work once the KS fix is in.

Detecting queries that transition to PENDING_SHUTDOWN and have missing topics would also cover the restart case too. The downside would be we’d still start the query, which may recreate internal topics that the user has previously deleted. However, as the query is still active I don’t really see this as much of an issue: the user can stop the query / drop the view if they want to avoid this.

Possible fix 2: detect queries stuck in PENDING_SHUTDOWN, which also have missing topics and close them. (Requires KS fix: https://github.com/apache/kafka/pull/7814)

Task tracking

First things first we should:

Confirm transient queries get stuck if internal topics deleted from under them, once metadata expires.
Confirm persistent queries get stuck if sink or internal topics deleted from under them, once metadata expires.

Tracked by: https://github.com/confluentinc/ksql/issues/4267

Add some log lines to indicate when recovery starts and stops, as this would of helped in the investigation.

Tracked by: https://github.com/confluentinc/ksql/issues/4269

Related Kafka fixes:

Have KS handle missing topics gracefully: https://issues.apache.org/jira/browse/KAFKA-9416 Have Kafka Producer.close handle missing topic: https://issues.apache.org/jira/browse/KAFKA-9398

apurvam commented 4 years ago

Thanks for the detailed analysis, @big-andy-coates ! It's very helpful!

On a restart there is no check that the source or sink topics exist, (KSQL bug), so the query is started.

Is there an issue for this bug? Without fixing that issue, there is no recourse for a user who deletes a sink or internal topic from under a running query right? Even on a restart the query will get back into a stuck state.

It should be fairly simple to either abort the restart process or mark these queries in ERROR as soon as we detect the sink /internal topics are deleted during startup.

rodesai commented 4 years ago

Possible fix 2: detect queries stuck in PENDING_SHUTDOWN, which also have missing topics and close them. (Requires KS fix: apache/kafka#7814)

I don't think we should add handling in KSQL to detect/resolve this. It's not a KSQL-specific problem, and makes assumptions about the internals of streams (e.g. what the internal topics are, what the possible error conditions are). Streams should be able to detect that the underlying topic is inaccessible and gracefully transition the query to an error state.

big-andy-coates commented 4 years ago

Thanks for the detailed analysis, @big-andy-coates ! It's very helpful!

On a restart there is no check that the source or sink topics exist, (KSQL bug), so the query is started.

Is there an issue for this bug? Without fixing that issue, there is no recourse for a user who deletes a sink or internal topic from under a running query right? Even on a restart the query will get back into a stuck state.

It should be fairly simple to either abort the restart process or mark these queries in ERROR as soon as we detect the sink /internal topics are deleted during startup.

The covered above, I don't think we should add code to KSQL to specifically handle this on start up. Any such fix would only fix half the problem. Better to fix the root cause, using one of the two listed solutions, or another solution.

big-andy-coates commented 4 years ago

Possible fix 2: detect queries stuck in PENDING_SHUTDOWN, which also have missing topics and close them. (Requires KS fix: apache/kafka#7814)

I don't think we should add handling in KSQL to detect/resolve this. It's not a KSQL-specific problem, and makes assumptions about the internals of streams (e.g. what the internal topics are, what the possible error conditions are). Streams should be able to detect that the underlying topic is inaccessible and gracefully transition the query to an error state.

Agreed, I was thinking more as a temporary fix. How do you feel about option 1 @rodesai ?

rodesai commented 4 years ago

Option 1 makes sense to me in general for handling failures from streams/clients.

confluentinc / ksql