Closed danmarcab closed 1 year ago
Nothing obvious come to mind, unfortunately.
Thank you for having a look @josevalim, really appreciated!
After debugging for another while we have found one of the possible causes (I say one because we've seen offsets accumulating without this error message) of this, the batcher
dies due to an unknown timer.
"GenServer MyService.Broadway.Broadway.Batcher_ignore terminating
** (RuntimeError) unknown timer #Reference<0.1181658914.3223060481.95258>
(broadway 1.0.3) lib/broadway/topology/batcher_stage.ex:207: Broadway.Topology.BatcherStage.cancel_batch_timeout/1
(broadway 1.0.3) lib/broadway/topology/batcher_stage.ex:148: Broadway.Topology.BatcherStage.deliver_batch/6
(broadway 1.0.3) lib/broadway/topology/batcher_stage.ex:118: Broadway.Topology.BatcherStage.handle_events_per_batch_key/3
(broadway 1.0.3) lib/broadway/topology/batcher_stage.ex:64: anonymous fn/2 in Broadway.Topology.BatcherStage.handle_events/3
(telemetry 1.1.0) /build/deps/telemetry/src/telemetry.erl:320: :telemetry.span/3
(broadway 1.0.3) lib/broadway/topology/batcher_stage.ex:54: Broadway.Topology.BatcherStage.handle_events/3
(gen_stage 1.1.2) lib/gen_stage.ex:2471: GenStage.consumer_dispatch/6
(gen_stage 1.1.2) lib/gen_stage.ex:2660: GenStage.take_pc_events/3
It seems that Broadway already accounts for the case where the timeout message has been received when the cancel timer returns false (https://github.com/dashbitco/broadway/blob/main/lib/broadway/topology/batcher_stage.ex#L201-L213), but there seems to be an edge case?
My guess is crashing the batcher is fine for other Producers where the ACK is not sequential, but for BroadwayKafka it seems like it's messing up.
I am looking at the code and I cannot see a code path that would make the error message above happen. Every time we cancel the timer, we delete the batch, which means it is impossible to recover the timer again.
This has been fixed in Broadway. There was an assumption that the timer message would be delivered automatically but that was not always the case.
This has been fixed in Broadway. There was an assumption that the timer message would be delivered automatically but that was not always the case.
This is great news! Thank you very much for all your work @josevalim ❤️
Hi all! 👋
First of all, thanks for the great libraries!
We are running into a strange issue where every now and then (roughly once a day) we start seeing offsets accumulating in the ack state of a producer (for one or more partitions).
From our debugging we understand the ack state values are of shape:
Looking into the affected producers we can see the
pending
andseen
lists keep growing indefinitely until the VM is OOM killed.The producer and processors seem to keep fetching and processing messages as evidenced by the
seen
list growth. The issue seems to be a small number of messages were never acked and so they remain in the front of thepending
list.We've been digging in the source code of broadway/broadway kafka and cannot find any point where messages/acks can get lost without trace (because we are not seeing any error logs or crash reports).
As a very hacky workaround we considering periodically checking for offset lag and then manually sending an
ack
message for the missingoffset_ids
in the front of thepending
to the producer if the lag is too big:This 'could' work for us since we mostly care about being up to date with the topic and we can assume missing a few messages, but it's far from ideal.
In case it's useful, our broadway pipeline is very simple, being roughly:
The error happens even when the
interested_in?
function returns false, and therefore no processing is done at all, just forwarding to theignore
batcher which does nothing.Is there anything obvious we are missing?