Closed hikrrish closed 7 years ago
Which Hermes version are you running? I don't recall any problems with suspending subscriptions, especially not ones that could cause message loss.
We are using version #0.11.0 of hermes frontend and consumers and 0.1.0-SNAPSHOT of hermes manager.
did some testing activation/ deactivation today. When messages are published after subscription is suspended and later activated (messages are not flowing) the system seems to be working fine.
All sorts of erroneous results, if subscription is deactivated and activated when messages are flowing with 500 msg/sec. I can see messages getting delivered even after the topic is suspended (may be it takes a few milliseconds to let cluster know if the topic is suspended which I believe is a trouble) after 1-2 seconds message delivery stops which confirms this understanding.
Once the topic is activated, we can see few duplicate messages on the receiver side
Why is the scenario you are describing erroneous? Are they missing, or they were simply delivered during the time period between clicking suspend and actually suspending subscription?
Clicking suspend is asynchronous by design. So it might take up to 30 seconds for all Consumers nodes to stop reading from Kafka and sending to subscriber.
Duplicates might happen, because there might be some messages already read from Kafka and on their way to subscriber, while subscription is suspended and information about successful delivery can't be committed to Kafka.
Yes, the messages are getting delivered in the time period we clicked suspend and actual suspension. We kind of expecting a transaction behavior from the consumer, where in the message wont get duplicated in this action. Thanks for explanation
@adamdubiel wouldn't this be solved if the delivered messages are committed. The pause can become effective soon after the last in-flight message is delivered.
This would effectively mean that time period between clicking "Suspend" and actually suspending could be as long as inflightTTL
, which we set to 1hour by default (waiting for last inflight message to get delivered).
Suspending means shutting down consumer gracefully, which should minimize the amount of messages that will be duplicated. Since Hermes (and Kafka) offers at least once guarantees, i don't know if it is worth optimizing, since duplicates can still occur in different places of the pipeline.
We should be able to handle this on the receiver end. Few of our services aren't idempotent in nature and which is where we noticed this were concerned.
Okay, does it mean i can close this issue? :)
Yes lets close this one. Thanks.
Great :)
@adamdubiel
We often use hermes-manager end points to deactivate /activate subscription (pause ), yesterday we noticed around 1900 messages lost and never received. I have the hermes message ids printed in producer logs and consumer logs and I could not see the matched message ids in consumer logs during the paused interval.
On checking kafka nodes I dont see any lag. Do you recall any defect with subcrion activation/deactivation using hermes manager end points ?