circuit breaker is open loops for ever during re-balance

joestein commented 9 years ago

I setup a test where I am producing 380,000 messages per second from 1 machine to 3 kafka brokers.

When I shutdown (cleanly) one of the brokers this happens

Per Second 387774 &{0xc2096cf240 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc20922c080 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc208a31b00 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc209305800 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc20815fc80 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc20847c440 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc209bf0780 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc2082cda40 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc209a4a6c0 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc208a43d00 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc2096cf280 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc20922c0c0 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc208a31b40 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc208c7e780 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc20815fcc0 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc20847c480 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc209bf07c0 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc209436a00 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc209a4a700 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc209520440 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc208b8fac0 kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.} &{0xc208b8fb00 circuit breaker is open} &{0xc208b8fb40 circuit breaker is open} &{0xc208454580 circuit breaker is open} &{0xc2084545c0 circuit breaker is open}

..... circuit breaker is open goes on forever and does not recover.

Also please see https://issues.apache.org/jira/browse/KAFKA-1825 not sure where the issue is but something (either on the Sarama side or Kafka side or both (not finger pointing) is not gelling.

eapache commented 9 years ago

..... circuit breaker is open goes on forever and does not recover.

After three consecutive failures to refresh the metadata and find a leader, the circuit breaker opens. When the circuit is open it will make one attempt every ten seconds until it succeeds.

Also please see https://issues.apache.org/jira/browse/KAFKA-1825 not sure where the issue is but something (either on the Sarama side or Kafka side or both (not finger pointing) is not gelling.

We have seen odd issues before where sometimes Kafka's view of the metadata gets out-of-date compared to what's in zookeeper. This looks like that, as far as I can tell.

eapache commented 9 years ago

If something on Sarama's end is still behaving unexpectedly, please provide more information per https://github.com/Shopify/sarama/blob/master/CONTRIBUTING.md#reporting-issues

joestein commented 9 years ago

When the circuit is open it will make one attempt every ten seconds until it succeeds.

Hmmm, I had 8 producer go routines running and it didn't seem like it was waiting 10 seconds as it was just flying by the screen however i was still sending messages...

We have seen odd issues before where sometimes Kafka's view of the metadata gets out-of-date compared to what's in zookeeper. This looks like that, as far as I can tell.

agreed but this doesn't happen with other clients and I am hoping we can track down what is causing it and make Kafka safe to what is causing it and maybe re-work whatever is causing it to be different too. I have seen this same type of issue in production at scale between Kafka and Sarama and just trying to help make both more resilient together.

eapache commented 9 years ago

Hmmm, I had 8 producer go routines running and it didn't seem like it was waiting 10 seconds as it was just flying by the screen however i was still sending messages...

It waits 10 seconds before trying again - messages that hit it in the meantime will fail immediately with ErrBrokerOpen (https://godoc.org/github.com/eapache/go-resiliency/breaker) without trying at all, in order to keep messages flowing to the other partitions which are hopefully still up. If you are producing as fast as possible (as in a benchmark situation) then you will burn CPU for 10 seconds generating messages that fail immediately.

The CircuitBreaker pattern is borrowed from Hystrix (https://github.com/Netflix/Hystrix) and ReleaseIt! (https://pragprog.com/book/mnee/release-it)

eapache commented 9 years ago

agreed but this doesn't happen with other clients

So if you start with the console producer in your described setup, the issue doesn't happen? Even though once it does happen, switching to the console producer doesn't help?

Is the console producer sending anywhere near the same amount of data? Perhaps it's just something that happens under very high volume...

joestein commented 9 years ago

switching to the console producer doesn't help?

no, once it is broken the only way to fix it is shutoff every broker and then turn them all back on again.

Perhaps it's just something that happens under very high volume...

I don't think so, one Sarama user I know using older producer doesn't have this issue. Also I can reproduce this by just running "go test" a few times https://github.com/stealthly/go_kafka_client/tree/wipAsyncSaramaProducer :( https://gist.github.com/joestein/8f6e74ea9da66a4ac924 which I noticed just now (as I was going in to tryto fix the branch I have for 03ad601663634fd75eb357fee6782653f5a9a5ed because tests weren't passing)

this doesn't happen on our master (uses e9965215dc7a797a30ac9dce5ab1104c8c458bfc) or at least haven't been able to reproduce this yet on our master against e9965215dc7a797a30ac9dce5ab1104c8c458bfc

eapache commented 9 years ago

no, once it is broken the only way to fix it is shutoff every broker and then turn them all back on again.

This part is an upstream kafka bug regardless of why it's getting into a bad state in the first place (and in principle it's an upstream bug that whatever Sarama is sending is putting it in a bad state, though presumably it's also a bug in Sarama).

this doesn't happen on our master (uses e996521)

There have only been three non-trivial changes to Sarama since that commit. If you definitely can't reproduce with it, a bisect would obviously be useful: I suggest trying a4d7eaf303a72a2e581f423ff57e075d9658039f, 324253e5ed71a1fe8dc53f277b6bba04a1b59f2f, and 198933323b86b22acd63a0977d3d33cac7f2ba37 as good base points; just those three SHAs should tell me which conceptual change (if any) introduced the problem.

eapache commented 9 years ago

I'm honestly not sure what else Sarama can do in this situation, or why you would see this bug with the revs you are running but not a few revs prior. The circuit breaker seems to be doing its job properly (in that when part of the cluster gets into a bad state it fails fast so that any remaining partitions don't get backed up).

I'm also not sure what Sarama could be sending to put the cluster in a bad state in the first place, we do rolling restarts under high load all the time and have never seen anything like this.

The gist you posted does not tell me much more than "the broker told us there was a leader election"; presumably we waited a bit and retried and then gave up, which means that perhaps the leader election just happened slowly? There is one case where we generate that error message ourselves, but I can't imagine that's what you're hitting here. I will add a better message around that case anyways but don't expect it to be relevant.

If you have any further information (sarama logs, or a simpler test case that can be run against 0.8.2 final, or...) please let me know, otherwise I'm out of ideas.

IBM / sarama

circuit breaker is open loops for ever during re-balance #239