dashbitco / broadway_kafka

A Broadway connector for Kafka
233 stars 53 forks source link

[Docs] Documentation about handling failure #91

Closed yordis closed 2 years ago

yordis commented 2 years ago

Hey there, in the Broadway's documentation says the following https://hexdocs.pm/broadway/Broadway.html#module-acknowledgements-and-failures

If there are no batchers, the acknowledgement will be done by processors.

As well as

Note however, that Broadway does not provide any sort of retries out of the box. This is left completely as a responsibility of the producer

So I am wondering how this producer deals with the failure.

Is there an opportunity to document the topic?

slashmili commented 2 years ago

@yordis just wondering what kind of failure do you have in mind?

  1. Failure because you taged a message a failed by using Broadway.Message.failed(message, "reason")
  2. Failure because there was an exception while the handle_message was running
  3. Failure because the beam crashed?

I suppose you mean all the cases but I think if we break it down, we can produce a better doc!

So my understanding:

  1. If you tag a message as failed, this lib still sends an ack to Kafka for that message. It's due to the fact how messages are stored in Kafka and it's not as simple of let's say RabbitMQ to reprocess a message
  2. if there is an exception, again there will be an ack sent to Kafka
  3. if beam crashes, there wouldn't be an opportunity to send an ack. so the message will be re-read again

One thing to keep in mind about the acking is that ack is sent after batch of messages are processed. Each GenStage consumer gets 5 tasks(as default value) and after all 5 are processed(with or without error) the last offset is going to be committed to the Kafka broker.

yordis commented 2 years ago

@slashmili right, I was referring to the 3 cases you describe.

Ideally, the documentation could showcase how to do Dead-Letter-Queues or a strategy to re-consume the message in case of the processor fails when you have to guarantee at-least-once successful processing where it is critical to make sure the message was processed.

slashmili commented 2 years ago

If we can validated my understandings, we can write them down in Handling failed messages section.

the documentation could showcase how to do Dead-Letter-Queues or a strategy to re-consume the message

This is complicated subject. While we can show that on handle_error, you could put the message in a DLQ(or another topic). Showing how to re-consume is highly depends on the application and nature of failure. It also requires the app keeping track of amount of retries and stop retrying after X times....

I think the first step is to create a PR to add a section to explain the failure scenarios and consequences.

yordis commented 2 years ago

@slashmili yeah, I hear you, I think they are endless use cases on how we could do it, but there are also a few basic ways we could showcase since they are common enough, like DLQs after X failures.

Ideally, what is the basic example of doing at least one-processing guarantee without losing any message?

josevalim commented 2 years ago

This is related to the discussion in #30. As you can see, there are many different approaches people may take. Documenting different strategies sound like a good call. :+1: Closing in favor of #30.