Open njhartwell opened 7 years ago
We are seeing this sometimes as well. I think that the wake up time out should be much higher. It currently is 3 seconds, but the default heartbeat in kafka is also 3 seconds. The heartbeat frequency is a factor in how long it takes to rebalance so it makes sense to me that the wake up time out should be significantly higher, perhaps close to the default session timeout of 10 seconds.
I'm facing something similar issue in my project:
09:11:43.738 [test.kafka.default-dispatcher-22] DEBUG o.a.k.c.c.i.ConsumerCoordinator - Group test committed offset 17872622155 for partition test-0
09:11:43.753 [test.kafka.default-dispatcher-14] INFO o.a.k.clients.producer.KafkaProducer - Closing the Kafka producer with timeoutMillis = 60000 ms.
09:11:43.755 [kafka-producer-network-thread | producer-1] DEBUG o.a.k.c.producer.internals.Sender - Beginning shutdown of Kafka producer I/O thread, sending remaining records.
09:11:43.759 [test.actor.default-dispatcher-48] DEBUG akka.kafka.internal.ProducerStage - Stage completed
09:11:43.762 [kafka-producer-network-thread | producer-1] DEBUG o.a.k.c.producer.internals.Sender - Shutdown of Kafka producer I/O thread has completed.
09:11:43.762 [test.kafka.default-dispatcher-14] DEBUG o.a.k.clients.producer.KafkaProducer - The Kafka producer has closed.
09:11:43.763 [test.actor.default-dispatcher-48] DEBUG akka.kafka.internal.ProducerStage - Producer closed
This happens randomly after couple of hours (or even days) of running. The reason looks to be this:
Exception in thread "main" java.lang.IllegalStateException: No current assignment for partition test-0
at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:264)
at org.apache.kafka.clients.consumer.internals.SubscriptionState.committed(SubscriptionState.java:269)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:739)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:716)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:784)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:765)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:186)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:149)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:116)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:493)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:322)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:253)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.pollNoWakeup(ConsumerNetworkClient.java:263)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread.run(AbstractCoordinator.java:907)
(com.typesafe.akka:akka-stream-kafka_2.12:0.15 with kafka brokers 0.10)
+1
I experience exactly the same problem quite randomly after wakeup:
07/13/2017 23:37:43.090|INFO ||||org.apache.kafka.common.config.AbstractConfig.logAll(AbstractConfig.java:180)|ProducerConfig values:
acks = 1
batch.size = 16384
block.on.buffer.full = false
bootstrap.servers = [kafka01-stage1.messagehub.services.us-south.bluemix.net:9093, kafka02-stage1.messagehub.services.us-south.bluemix.net:9093, kafka03-stage1.messagehub.services.us-south.bluemix.net:9093, kafka04-stage1.messagehub.services.us-south.bluemix.net:9093, kafka05-stage1.messagehub.services.us-south.bluemix.net:9093]
buffer.memory = 33554432
client.id =
compression.type = none
connections.max.idle.ms = 540000
interceptor.classes = null
key.serializer = class org.apache.kafka.common.serialization.StringSerializer
linger.ms = 0
max.block.ms = 60000
max.in.flight.requests.per.connection = 5
max.request.size = 1048576
metadata.fetch.timeout.ms = 60000
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.sample.window.ms = 30000
partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
receive.buffer.bytes = 32768
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retries = 0
retry.backoff.ms = 100
sasl.jaas.config = [hidden]
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.mechanism = PLAIN
security.protocol = SASL_SSL
send.buffer.bytes = 131072
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2]
ssl.endpoint.identification.algorithm = HTTPS
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLSv1.2
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
timeout.ms = 30000
value.serializer = class org.apache.kafka.common.serialization.StringSerializer
07/13/2017 23:37:43.094|INFO ||||org.apache.kafka.common.utils.AppInfoParser$AppInfo.<init>(AppInfoParser.java:83)|Kafka version : 0.10.2.1
07/13/2017 23:37:43.098|INFO ||||org.apache.kafka.common.utils.AppInfoParser$AppInfo.<init>(AppInfoParser.java:84)|Kafka commitId : e89bffd6b2eff799
07/13/2017 23:37:43.099|INFO ||||org.apache.kafka.common.utils.AppInfoParser$AppInfo.<init>(AppInfoParser.java:83)|Kafka version : 0.10.2.1
07/13/2017 23:37:43.099|INFO ||||org.apache.kafka.common.utils.AppInfoParser$AppInfo.<init>(AppInfoParser.java:84)|Kafka commitId : e89bffd6b2eff799
07/13/2017 23:37:44.534|INFO ||||org.apache.kafka.clients.consumer.internals.AbstractCoordinator$GroupCoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:586)|Discovered coordinator kafka05-stage1.messagehub.services.us-south.bluemix.net:9093 (id: 2147483643 rack: null) for group test-UsageMetricsSpec-group-1.
07/13/2017 23:37:44.539|INFO ||||org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinPrepare(ConsumerCoordinator.java:397)|Revoking previously assigned partitions [] for group test-UsageMetricsSpec-group-1
07/13/2017 23:37:44.539|INFO ||||org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendJoinGroupRequest(AbstractCoordinator.java:420)|(Re-)joining group test-UsageMetricsSpec-group-1
[WARN] [07/13/2017 23:37:46.191] [ml-event-client-units-akka.kafka.default-dispatcher-20] [akka://ml-event-client-units/system/kafka-consumer-2] Consumer interrupted with WakeupException after timeout. Message: null. Current value of akka.kafka.consumer.wakeup-timeout is 3000 milliseconds
07/13/2017 23:37:46.640|INFO ||||org.apache.kafka.clients.consumer.internals.AbstractCoordinator$1.onSuccess(AbstractCoordinator.java:388)|Successfully joined group test-UsageMetricsSpec-group-1 with generation 5
After:
Successfully joined group test-UsageMetricsSpec-group-1 with generation 5
consumer gets stuck forever
In my case increasing wakeup timeout didn't help. Consumers get stuck rarely but they still get stuck. It is a big issue because Kafka brokers think partitions are assign to the consumer. The only way to "unclog" the consumer is to trigger a group rebalance. Is the way reactive-kafka uses a wakeup mechanism an issue or vanilla kafka consumer implementation?
I am also facing this issue. Thrice I saw in the past month. The rebalance happens frequently in my application. But at times, I see 0 partitions assinged (I subsrcibed to 1 partition)
Does it take time to get update that the partition is already revoked ?? Should we poll again , if this arises ??
I am pretty confident that the issue is that the poll initiated by the KafkaConsumerActor is grabbing the metadata and joining the group, but the wakeupexception interrupts this process after the group is joined but before the callbacks (ConsumerRebalanceListener) are called which are used to tell the SingleSourceLogic stuff which topics are subscribed to. The SingleSourceListener makes requests to the KafkaConsumerActor for the topic partitions that it knows about.. but they either aren't the right ones or are empty because the callback was never called. To be clear, the KafkaConsumer has been assigned partitions in this case, but because the SingleSourceLogic doesn't know about them, they are never consumed from.
I'm not fully sure how the code should be changed to deal with this problem, but the workaround we are implementing right now is to set the wake up timeout a little longer (6 seconds or so), set the max wakeup timeouts to 1, and restart the stream whenever the max wakeup timeouts is encountered.
We're facing the same issue.
We have a service which sometime stops receiving data right after a WakeupException. The issue doesn't happen on every WakeupException, but when it happens it's always after a WakeupException.
The explanation of @jpeel fit pretty well with our observation.
We'll try and implement the proposed workaround, but a more structured way of solving the issue (without restarting the stream) would be definitely appreciated.
Has there been any activity on this issue?
Ping @patriknw @kciesielski. Given you're the last to have worked on the code, do you have any insights?
Defaults:
wakeup-timeout = 3s
max-wakeups = 10
So you mean that increasing the wakeup-timeout to let's say 9s would be different than trying 3 times with 3s?
The reason it's bad to have long timeout is that it's blocking a thread during that time, and that can easily result in starvation and unresponsiveness of other streams running on the same dispatcher. But sure, if that solves the problem please go ahead and create a PR.
Note that the reason for retrying at all, compared to just failing the stream immediately and only rely on stream restart is that then it can hopefully survive short glitches without loosing the state.
@patriknw I think the point of @jpeel was that sometimes when the timeout occurs it somehow leaves Kafka consumer in a state where no messages are being received (even though there are messages available) and the stream becomes stuck. The proposed workaround of raising the timeout and reset the stream was to circumvent the problem but indeed it is NOT a solution.
I think that the retry mechanism is the right way to go, as you said it would prevent losing the stream state. Given the thesis of @jpeel on why the streams gets stuck, do you think it is plausible? Do you have any idea on how to avoid the issue while maintaining the retry mechanism?
I'm willing to try and put together a PR, but I don't really know the internals so any guidance would be welcome.
Ok, read his comment again. Is there a way we can reset things completely after a wakeup interupt? So that it is treated as a new consumer in next attempt? Or can we get the assigned partitions with a merhod call after wakeup, since we might have missed the callback. Other ideas?
@patriknw @jpeel I've tried to put together an implementation of the first strategy you proposed: reset things completely after a wakeup. commit
The base idea is to just unsubscribe and re-subscribe right away to all the topics. This way the assignment callbacks will be called and the state of the Single/SubSourceLogic should be consistent. Note that I had to introduce another callback just to clean the state of the Single/SubSourceLogic to make things consistent.
While I was working on this it came to me that another strategy we could use is to:
This strategy would still require to keep some book keeping on the KafkaConsumerActor side but would free the Single/SubSourceLogic from any change. This requires that the internal state of the KafkaConsumer is not corrupted after the wakeup, which I think is a reasonable assumption to make.
I'll try to work on this approach in the next few days, in the meantime could someone take a look at the current proposal and give me some feedback?
@patriknw @jpeel I've implemented the alternative strategy of reconcile the TopicPartition assignments on WakeupException. commit
As I stated in the previous comment this strategy tries to reconcile the new KafkaConsumer assignments with the Source's by manually calling the RebalanceListener callbacks for all revoked/added TopicPartition assignment.
Can someone take a look at both approaches and give some feedback? For both solution I've tested that everything works in the normal case but I'm not sure how to manually cause the issue (stopping the Kafka broker will not work as the consumer detects that the broker is not there and won't follow the code path leading to the wakeup). Any ideas on how to test this in a controlled environment?
If we compare a normal scenario where the partitions get assigned
INFO [2017-10-22 09:16:48,855] [ConsumerCoordinator:256] [] [] - Setting newly assigned partitions [coffee_manager_events_1.0-2, coffee_manager_events_1.0-1, coffee_manager_events_1.0-0, coffee_manager_events_1.0-4, coffee_manager_events_1.0-3] for group coffee-manager
INFO [2017-10-22 09:16:48,855] [AbstractCoordinator:388] [] [] - Successfully joined group coffee-manager with generation 9
INFO [2017-10-22 09:16:48,852] [AbstractCoordinator:420] [] [] - (Re-)joining group coffee-manager
INFO [2017-10-22 09:16:47,504] [AbstractCoordinator:420] [] [] - (Re-)joining group coffee-manager
INFO [2017-10-22 09:16:47,504] [ConsumerCoordinator:397] [] [] - Revoking previously assigned partitions [] for group coffee-manager
bad scenario where it happens
INFO [2017-10-22 09:17:53,401] [AbstractCoordinator:388] [] [] - Successfully joined group coffee-manager with generation 11
WARN [2017-10-22 09:17:53,401] [KafkaConsumerActor:78] [coffeemanager-akka.kafka.default-dispatcher-62041] [akka.tcp://coffeemanager@contoso.com2551/system/kafka-consumer-9] - Consumer interrupted with WakeupException after timeout. Message: null. Current value of akka.kafka.consumer.wakeup-timeout is 3000 milliseconds
INFO [2017-10-22 09:17:53,377] [AbstractCoordinator:388] [] [] - Successfully joined group coffee-manager with generation 11
INFO [2017-10-22 09:17:52,959] [AbstractCoordinator:420] [] [] - (Re-)joining group coffee-manager
INFO [2017-10-22 09:17:50,433] [AbstractCoordinator:420] [] [] - (Re-)joining group coffee-manager
INFO [2017-10-22 09:17:50,433] [ConsumerCoordinator:397] [] [] - Revoking previously assigned partitions [] for group coffee-manager
It for sure points to some floppy consumers but wondering if reactive-kafka can be more resilient to this? One way to self-check would be to use the driver consumer-coordinator-metrics and get the assigned partitions count which should subscribed to go up within a certain period and if not perhaps let it escalate to the supervisor?
@schrepfler wouldn't the strategy I proposed in the last message fix the problem?
Without knowing too many Kafka internals, I think WakeupExceptions are not such a rare events in reality, is it acceptable to reconcile all the time? I've noticed some callbacks being created and you mention this Single/SubSourceLogic, is that the best/only technique we have at the driver level? There's no risk of getting in some bad state, or race condition, or memory leak?
@nivox The second alternative looks attractive since it contains the problem inside the KakaConsumerActor. I don't know if there are any drawbacks of always doing that for the WakeupExc.
@nivox i like the second solution. I rebased it and added some more logging around wake up exceptions. Can you review https://github.com/akka/reactive-kafka/pull/395 ?
@chbatey: Thanks for taking the issue forward, I was a little bit unsure on how to proceed due to the lack of feedback.
I've taken a look at the pull request and it seems good to me.
Any update on this? We are facing the same issue, increasing the wakeup timeout to 6s and setting the the max to 1 seems to resolve the issue but doesn't feel really solid to me.
Am I correct to assume that MR https://github.com/akka/reactive-kafka/pull/395 did not fully resolve the issue?
We'd like to understand if people still experience this with 0.20 and 0.21.
I had this issue on 0.16 and was fortunate to catch the 0.22 on the release day. I haven't seen the issue since.
Thank you for this update. Anyone else?
Im using 0.22 and i'm having the same issue.
I'll try with changing the settings.
@ennru, we just encountered this issue using v0.22. We stopped consuming a subset of topic partitions after the error: KafkaConsumer poll has exceeded wake up timeout (3 seconds). Waking up consumer to avoid thread starvation.
We were using the default configurations.
We have encountered the wake-up timeout error 60 times in the past 30 days, but only once did it stop consuming.
Anecdotally, we used to encounter this issue more often on older versions of alpakka-kafka. We would also usually lose all topic partitions, not just a subset. As a workaround, we added an idleTimeout
flow element to terminate and restart the stream if it stopped consuming all topic partitions, but of course that didn't work when only losing a subset of them.
We are looking at #614 in v1.0 to ultimately resolve this issue, but we are interested if any workarounds have been effective.
Thank you for reporting that here. So some state in the actor or Kafka client seems to be lost upon wake-up in certain cases. I'm not aware of a workaround for this. We've improved the error-handling around polling in 1.0-M1 a bit more, but you're right #614 would make the wakeup-dance go away completely.
@ennru any idea when version 1.0 will be production ready?. I see that for now is just RC version.
It is considered production ready. We do not expect any significant changes and might call it 1.0 before the end of March 2019.
I have the same problem. After investigation, I found that there are two consumers.I hope my ideas can help you. Command used: bin/kafka-consumer-groups.sh --bootstrap-server [your kafka server] --group [your group name] --describe
Thank you @ValueYouth, we haven't found any evidence that this is caused by Alpakka Kafka, so your hint might help others to identify the problem.
FWIW: https://issues.apache.org/jira/browse/KAFKA-9917 --- I've seen this as well several times after broker restart. In my app, consumer source was inside RestartSource, so in fact both restarted.
I am using Confluent kafka in dotnetcore I need to check if the Assignments are greater than zero while consuming. But I am ending up in a continous loop without any assignments
Os: Windows 64 Kafka: Kafka 2.8.1 dotnet: 6.0 library: ConfluentKafka I have a event-log-master Topic created with 5 partitions on my kafka broker
Here's sample code in dotnetcore that I using for my testing to reproduce.
ConsumerConfig conf = new ConsumerConfig();
conf.BootstrapServers = "localhost:9092";
conf.EnableAutoCommit = false;
conf.SessionTimeoutMs = 30000;
conf.TopicMetadataRefreshIntervalMs = 10000;
conf.AutoOffsetReset = AutoOffsetReset.Latest;
conf.MaxPollIntervalMs = 300 * 60 * 1000;
conf.MaxPartitionFetchBytes = 10000;
conf.GroupId = "test-group-poc-1";
conf.AllowAutoCreateTopics = true;
conf.PartitionAssignmentStrategy = PartitionAssignmentStrategy.RoundRobin;
var pattern = "event-log-master";
long count = 0;
CancellationTokenSource cts = new CancellationTokenSource();
Console.CancelKeyPress += (_, e) => {
e.Cancel = true; // prevent the process from terminating.
cts.Cancel();
};
using (var consumer = new ConsumerBuilder<string, string>(conf).Build())
{
consumer.Subscribe(pattern);
try
{
while (!cts.Token.IsCancellationRequested)
{
if (consumer.Assignment.Count > 0)
{
var record = consumer.Consume(cts.Token);
Console.WriteLine($"Consumed {record?.Message}");
if (record != null)
{
consumer.Commit(record);
}
}
else
{
ConsumeResult<string, string> msg = null;
// var msg = consumer.Consume(0);
consumer.Subscribe(pattern);
Thread.Sleep(1000);
Console.WriteLine($"ZeroAssignments: {consumer.Name} | msg: {msg?.Message}");
}
}
}
catch (OperationCanceledException)
{
// Ctrl-C was pressed.
}
finally
{
consumer.Close();
}
I found a workaround which is somehow giving the assignments.
I tried to call consumer.Consume(0)
with 0 timeout when there are not assignments after the loop runs for 2 to 3 times we are getting the assignments. This behavior seems to be wierd.
What's the right fix. I was trying to re subscribe If I donet find any assignments, I know it doesnt make sense but I was just trying.
We are using reactive-kafka (Using com.typesafe.akka:akka-stream-kafka_2.11:0.14) with kafka 10 brokers and are frequently seeing consumers come up (usually immediately following the termination of another consumer in the group) and having no partitions assigned. Logs look like this:
Relevant code looks like this:
My apologies for lack of runnable test case, etc. (that will soon follow if need be); just wanted to get this out there in case there is a known issue behind it or something obviously wrong with how we're using the library. Thanks!