allegro / hermes

Fast and reliable message broker built on top of Kafka.
http://hermes.allegro.tech
Other
818 stars 218 forks source link

Subscription paused using hermes manager endpoints #766

Closed hikrrish closed 7 years ago

hikrrish commented 7 years ago

@adamdubiel

We often use hermes-manager end points to deactivate /activate subscription (pause ), yesterday we noticed around 1900 messages lost and never received. I have the hermes message ids printed in producer logs and consumer logs and I could not see the matched message ids in consumer logs during the paused interval.

On checking kafka nodes I dont see any lag. Do you recall any defect with subcrion activation/deactivation using hermes manager end points ?

adamdubiel commented 7 years ago

Which Hermes version are you running? I don't recall any problems with suspending subscriptions, especially not ones that could cause message loss.

hikrrish commented 7 years ago

We are using version #0.11.0 of hermes frontend and consumers and 0.1.0-SNAPSHOT of hermes manager.

hikrrish commented 7 years ago

did some testing activation/ deactivation today. When messages are published after subscription is suspended and later activated (messages are not flowing) the system seems to be working fine.

All sorts of erroneous results, if subscription is deactivated and activated when messages are flowing with 500 msg/sec. I can see messages getting delivered even after the topic is suspended (may be it takes a few milliseconds to let cluster know if the topic is suspended which I believe is a trouble) after 1-2 seconds message delivery stops which confirms this understanding.

Once the topic is activated, we can see few duplicate messages on the receiver side

adamdubiel commented 7 years ago

Why is the scenario you are describing erroneous? Are they missing, or they were simply delivered during the time period between clicking suspend and actually suspending subscription?

Clicking suspend is asynchronous by design. So it might take up to 30 seconds for all Consumers nodes to stop reading from Kafka and sending to subscriber.

Duplicates might happen, because there might be some messages already read from Kafka and on their way to subscriber, while subscription is suspended and information about successful delivery can't be committed to Kafka.

hikrrish commented 7 years ago

Yes, the messages are getting delivered in the time period we clicked suspend and actual suspension. We kind of expecting a transaction behavior from the consumer, where in the message wont get duplicated in this action. Thanks for explanation

gamefundas commented 7 years ago

@adamdubiel wouldn't this be solved if the delivered messages are committed. The pause can become effective soon after the last in-flight message is delivered.

adamdubiel commented 7 years ago

This would effectively mean that time period between clicking "Suspend" and actually suspending could be as long as inflightTTL, which we set to 1hour by default (waiting for last inflight message to get delivered). Suspending means shutting down consumer gracefully, which should minimize the amount of messages that will be duplicated. Since Hermes (and Kafka) offers at least once guarantees, i don't know if it is worth optimizing, since duplicates can still occur in different places of the pipeline.

gamefundas commented 7 years ago

We should be able to handle this on the receiver end. Few of our services aren't idempotent in nature and which is where we noticed this were concerned.

adamdubiel commented 7 years ago

Okay, does it mean i can close this issue? :)

gamefundas commented 7 years ago

Yes lets close this one. Thanks.

adamdubiel commented 7 years ago

Great :)