dashbitco / broadway_kafka

A Broadway connector for Kafka
222 stars 52 forks source link

drain_after_revoke failed due to killed process #117

Closed yordis closed 1 year ago

yordis commented 1 year ago

I am receiving the following error in Sentry:

Sentry.CrashError: ** (exit) exited in: GenServer.call(#PID<0.5095.0>, :drain_after_revoke, :infinity)
    ** (EXIT) killed
  File "lib/gen_server.ex", line 1030, in GenServer.call/3
  File "lib/broadway_kafka/producer.ex", line 525, in anonymous fn/2 in BroadwayKafka.Producer.assignments_revoked/1
  File "/opt/app/deps/telemetry/src/telemetry.erl", line 320, in :telemetry.span/3
  File "/opt/app/deps/brod/src/brod_group_coordinator.erl", line 502, in :brod_group_coordinator.stabilize/3
  File "/opt/app/deps/brod/src/brod_group_coordinator.erl", line 416, in :brod_group_coordinator.handle_info/2
  File "gen_server.erl", line 695, in :gen_server.try_dispatch/4
  File "gen_server.erl", line 771, in :gen_server.handle_msg/6
  File "proc_lib.erl", line 226, in :proc_lib.init_p_do_apply/3

Coming from https://github.com/dashbitco/broadway_kafka/blob/271464fdcbe1e06bef75572319cf9ef9e5f01c41/lib/broadway_kafka/producer.ex#L525

I wondering if we should catch the error and return :ok here.

thoughts?

slashmili commented 1 year ago

When a new consumer is joining the consumer group, Kafka asks all the consumers to stop what they are doing and join the new generation(hence drain_after_revoke call)

At the same time your erlang node is trying stop all the processes as the deployment is triggering that.

~I think what is happening here is that your broadway consumers are not finishing the job on time and the beam is killing them forcefully.~ ~Edit1: What I wrote here doesn't make sense since broadway consumers are independent of the producer process.~ Edit2: What I said originally make sense, the producer waits for all the handover jobs to be finished before returning to handle_call

I'd suggest to measure the consumption time for your messages using telemetry. If they are low(~20-30 milliseconds) it could be that the dispatcher is overloaded

yordis commented 1 year ago

Maybe related to?

josevalim commented 1 year ago

We have pushed several improvements here, including a just published new version. Please let us know if the error persists!