confluentinc / ksql

The database purpose-built for stream processing applications.
https://ksqldb.io
Other
77 stars 1.04k forks source link

TERMINATE query hangs and doesn't work #5107

Open fedexist opened 4 years ago

fedexist commented 4 years ago

Describe the bug We're experiencing a bug where some of the queries we have deployed (all stateless, all converting json to avro) are experiencing really slow consuming and fail to be terminated when the command is issued.

To Reproduce Steps to reproduce the behavior, include:

  1. The version of KSQL: CLI v5.2.4, Server v5.2.4
  2. Cannot share the data, but it's a simple json structured as a CDC event, with 20-30 fields depending on the source table
  3. All streams are structured as such:
    CREATE STREAM json_stream (fields .....) WITH (VALUE_FORMAT='json', KAFKA_TOPIC='source_topic_json')
    CREATE STREAM avro_stream WITH (VALUE_FORMAT='AVRO', KAFKA_TOPIC='target_topic_avro') AS SELECT [fields + some casts from string to timestamp] FROM json_stream

    Please note that we have around 100 streams structured like this and this behaviour is evident only in a subset of these.

[root@kafka-ksql-srv-1 ~]# cat /etc/ksql/ksql-server.properties
bootstrap.servers=brokers
ksql.schema.registry.url=schemaregistry
listeners=http://kafka-ksql-srv-1:8088
ksql.streams.producer.retries=2147483647
ksql.streams.producer.confluent.batch.expiry.ms=9223372036854775807
ksql.streams.producer.request.timeout.ms=300000
ksql.streams.producer.max.block.ms=9223372036854775807
ksql.sink.replicas=2
ksql.streams.replication.factor=2
ksql.streams.state.dir=/mnt/data/ksql
ksql.streams.num.standby.replicas=1
ksql.streams.num.stream.threads=3

Expected behavior I'd expect the query to terminate, deleting the corresponding consumer group and allowing me to drop the stream.

Actual behaviour Trying to terminate these queries causes the CLI to hang, effectively failing the termination. Opening another terminal and/or restarting the server, we can see that the query is not actually terminated and still linked to the stream, which cannot then be dropped. The corresponding consumer group does not get deleted from the brokers. Creating another stream (with different naming, same topology), consuming from the same topic gives the same result of slow consuming and impossibility to terminate the newly created queries.

KSQL parallelism is set to 3 and from the ksql-streams.log we can that 2 of the threads get shutdown, leaving one in pending shutdown:

[2020-04-17 20:14:03,950] INFO stream-client [_confluent-ksql-default_query_CSAS_PROD_GTEL_AFA_GOL_ANAGRAFE_CARTE_CREDITO_INTERNAL2_AVRO_243-bb3a9aa9-a240-44c9-a68c-ad47f26e6dfd] State transition from RUNNING to PENDING_SHUTDOWN (org.apache.kafka.streams.KafkaStreams:263)
[2020-04-17 20:14:03,955] INFO stream-thread [_confluent-ksql-default_query_CSAS_PROD_GTEL_AFA_GOL_ANAGRAFE_CARTE_CREDITO_INTERNAL2_AVRO_243-bb3a9aa9-a240-44c9-a68c-ad47f26e6dfd-StreamThread-448] Informed to shut down (org.apache.kafka.streams.processor.internals.StreamThread:1211)
[2020-04-17 20:14:03,958] INFO stream-thread [_confluent-ksql-default_query_CSAS_PROD_GTEL_AFA_GOL_ANAGRAFE_CARTE_CREDITO_INTERNAL2_AVRO_243-bb3a9aa9-a240-44c9-a68c-ad47f26e6dfd-StreamThread-448] State transition from RUNNING to PENDING_SHUTDOWN (org.apache.kafka.streams.processor.i
nternals.StreamThread:220)
[2020-04-17 20:14:03,959] INFO stream-thread [_confluent-ksql-default_query_CSAS_PROD_GTEL_AFA_GOL_ANAGRAFE_CARTE_CREDITO_INTERNAL2_AVRO_243-bb3a9aa9-a240-44c9-a68c-ad47f26e6dfd-StreamThread-449] Informed to shut down (org.apache.kafka.streams.processor.internals.StreamThread:1211)
[2020-04-17 20:14:03,959] INFO stream-thread [_confluent-ksql-default_query_CSAS_PROD_GTEL_AFA_GOL_ANAGRAFE_CARTE_CREDITO_INTERNAL2_AVRO_243-bb3a9aa9-a240-44c9-a68c-ad47f26e6dfd-StreamThread-449] State transition from RUNNING to PENDING_SHUTDOWN (org.apache.kafka.streams.processor.i
nternals.StreamThread:220)
[2020-04-17 20:14:03,959] INFO stream-thread [_confluent-ksql-default_query_CSAS_PROD_GTEL_AFA_GOL_ANAGRAFE_CARTE_CREDITO_INTERNAL2_AVRO_243-bb3a9aa9-a240-44c9-a68c-ad47f26e6dfd-StreamThread-450] Informed to shut down (org.apache.kafka.streams.processor.internals.StreamThread:1211)
[2020-04-17 20:14:03,959] INFO stream-thread [_confluent-ksql-default_query_CSAS_PROD_GTEL_AFA_GOL_ANAGRAFE_CARTE_CREDITO_INTERNAL2_AVRO_243-bb3a9aa9-a240-44c9-a68c-ad47f26e6dfd-StreamThread-450] State transition from RUNNING to PENDING_SHUTDOWN (org.apache.kafka.streams.processor.i
nternals.StreamThread:220)
[2020-04-17 20:14:04,027] INFO stream-thread [_confluent-ksql-default_query_CSAS_PROD_GTEL_AFA_GOL_ANAGRAFE_CARTE_CREDITO_INTERNAL2_AVRO_243-bb3a9aa9-a240-44c9-a68c-ad47f26e6dfd-StreamThread-449] Shutting down (org.apache.kafka.streams.processor.internals.StreamThread:1225)
[2020-04-17 20:14:04,035] INFO stream-thread [_confluent-ksql-default_query_CSAS_PROD_GTEL_AFA_GOL_ANAGRAFE_CARTE_CREDITO_INTERNAL2_AVRO_243-bb3a9aa9-a240-44c9-a68c-ad47f26e6dfd-StreamThread-448] Shutting down (org.apache.kafka.streams.processor.internals.StreamThread:1225)
[2020-04-17 20:14:04,037] INFO stream-thread [_confluent-ksql-default_query_CSAS_PROD_GTEL_AFA_GOL_ANAGRAFE_CARTE_CREDITO_INTERNAL2_AVRO_243-bb3a9aa9-a240-44c9-a68c-ad47f26e6dfd-StreamThread-449] State transition from PENDING_SHUTDOWN to DEAD (org.apache.kafka.streams.processor.inte
rnals.StreamThread:220)
[2020-04-17 20:14:04,037] INFO stream-thread [_confluent-ksql-default_query_CSAS_PROD_GTEL_AFA_GOL_ANAGRAFE_CARTE_CREDITO_INTERNAL2_AVRO_243-bb3a9aa9-a240-44c9-a68c-ad47f26e6dfd-StreamThread-449] Shutdown complete (org.apache.kafka.streams.processor.internals.StreamThread:1245)
[2020-04-17 20:14:04,050] INFO stream-thread [_confluent-ksql-default_query_CSAS_PROD_GTEL_AFA_GOL_ANAGRAFE_CARTE_CREDITO_INTERNAL2_AVRO_243-bb3a9aa9-a240-44c9-a68c-ad47f26e6dfd-StreamThread-448] State transition from PENDING_SHUTDOWN to DEAD (org.apache.kafka.streams.processor.inte
rnals.StreamThread:220)
[2020-04-17 20:14:04,050] INFO stream-thread [_confluent-ksql-default_query_CSAS_PROD_GTEL_AFA_GOL_ANAGRAFE_CARTE_CREDITO_INTERNAL2_AVRO_243-bb3a9aa9-a240-44c9-a68c-ad47f26e6dfd-StreamThread-448] Shutdown complete (org.apache.kafka.streams.processor.internals.StreamThread:1245)

ksql.log just shows the correct POST of the TERMINATE query, just after the attempted DROP STREAM (failed because of the query still active):

[2020-04-17 22:31:41,620] INFO Build logical plan for DROP STREAM prod_gtel_afa_gol_anagrafe_carte_credito_internal2_avro;. (io.confluent.ksql.QueryEngine:83)
[2020-04-17 22:31:41,647] INFO 10.200.81.10 - - [17/Apr/2020:22:31:41 +0000] "POST /ksql HTTP/1.1" 400 510  34 (io.confluent.rest-utils.requests:60)

Additional context I recently fixed an issue on this server where I was getting "Too many open files" exception, because the host was not configured correctly. I wonder if this issue could be related to some other host misconfiguration.

If there's any more info I could provide, please tell me.

Thank you

jamesmacfie commented 4 years ago

I also came across this. In our case we are running this inside Kubernetes and have designed it so that if the KSQL server pod gets restarted then we have persisted the current state so there's no data loss.

Because of that safety net I deleted the KSQL server pod and Kubernetes recreated it again. The recreated pod did not have this issue.

In conclusion, I don't know what the issue was but turning it on and of again seemed to work 😄