KSQL not reattaching to consumer groups after restart

mbrancato commented 5 years ago

I ran into an issue where KSQL (5.0.0) stopped or severely slowed in receiving messages from a topic. While I tried increasing the partitions and more KSQL instances, that didn't help. At some point, messages basically stopped (I'm looking into control center for better visibility).

What I did find was that when restarting KSQL, it did not seem to reattach to the consumer groups properly. A consumer groups would transition to RUNNING in the log output, but when monitoring that consumer group using kafka-consumer-groups.sh there wouldn't be any increase in the current offset for any partitions. I let this go for a long time and there is no movement in the current offset.

I ran the kafka-console-consumer.sh against one of the existing topics and immediately received events, so I think Kafka is working fine.

Is there any known workaround or solution for this?

mbrancato commented 5 years ago

Something I noticed is that every now and then I get a log from KSQL like the following:

[2018-10-24 01:06:06,367] WARN stream-thread [_confluent-ksql-default_query_InsertQuery_42-2fdba422-f172-45f0-966a-1fee31bfb44d-StreamThread-172] Detected task 0_0 that got migrated to another thread. This implies that this thread missed a rebalance and dropped out of the consumer group. Will try to rejoin the consumer group. Below is the detailed description of the task:
>TaskId: 0_0
>>      ProcessorTopology:
>               KSTREAM-SOURCE-0000000000:
>                       topics:         [events]
>                       children:       [KSTREAM-MAPVALUES-0000000001]
>               KSTREAM-MAPVALUES-0000000001:
>                       children:       [KSTREAM-TRANSFORMVALUES-0000000002]
>               KSTREAM-TRANSFORMVALUES-0000000002:
>                       children:       [KSTREAM-FILTER-0000000003]
>               KSTREAM-FILTER-0000000003:
>                       children:       [KSTREAM-MAPVALUES-0000000004]
>               KSTREAM-MAPVALUES-0000000004:
>                       children:       [KSTREAM-MAPVALUES-0000000005]
>               KSTREAM-MAPVALUES-0000000005:
>                       children:       [KSTREAM-SINK-0000000006]
>               KSTREAM-SINK-0000000006:
>                       topic:          StaticTopicNameExtractor(ALERTS)
>Partitions [events-0]
 (org.apache.kafka.streams.processor.internals.StreamThread:773)
[2018-10-24 01:06:06,404] INFO stream-thread [_confluent-ksql-default_query_InsertQuery_173-c2066571-ae54-4f4e-8195-a6bf13b41f08-StreamThread-693] partition assignment took 1717868 ms.
        current active tasks: [0_0, 0_4]
        current standby tasks: []
        previous active tasks: []
 (org.apache.kafka.streams.processor.internals.StreamThread:280)

mbrancato commented 5 years ago

After some lengthy monitoring and log tailing, Here is what I think happens:

The KSQL cluster start up and begins attaching "stream threads" to consumer groups.
It starts consuming from some partitions of the topic via the consumer group.
Eventually, one or more stream thread fails.
Kafka starts to rebalance the consumer group, this takes some time.
IF stream threads > 0
GOTO 2
ELSE
Do nothing forever.

Even when reattaching a single KSQL instance, this occurs, but it does not churn as frequently. The biggest problem is I need multiple KSQL instances to keep up with my event stream. This is probably a good use case for allowing the forced removal of consumer groups with the new consumer. I honestly think that would be a workaround.

slcraciun commented 5 years ago

Hey @mbrancato, I have encountered the same issue. Did you succeed to find a workaround for it?

slcraciun commented 5 years ago

Those errors stopped appearing on my side when I've removed the following two env variables from my deployment:

- name: KSQL_PRODUCER_INTERCEPTOR_CLASSES
value: io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor
- name: KSQL_CONSUMER_INTERCEPTOR_CLASSES
value: io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor

rodesai commented 4 years ago

@mbrancato were you able to find a fix for this issue? Can you share your KSQL configuration?

mbrancato commented 4 years ago

I was not. This was an operational issue over a year ago and I'm guessing we rebuilt everything to get up and running again. I can close this if needed.

confluentinc / ksql

KSQL not reattaching to consumer groups after restart #2087