bbalser / elsa

Apache License 2.0
77 stars 10 forks source link

Termination due to timeout during high throughput #71

Open m0rt3nlund opened 4 years ago

m0rt3nlund commented 4 years ago

Hi!

I posted this question in the repo for off_broadway_kafka also, but i think this is the most correct place for it:

I have a problem when there is high throughput. I get this error message and I cannot figure out why it times out. Any suggestions on how I can continue my investigation. Any other info I can provide?

2020-03-05 - 21:36:40.296: [error] GenServer #PID<0.2926.0> terminating ** (stop) exited in: GenServer.call({:via, Elsa.Registry, {:elsa_registry_per_partition, Elsa.Group.Manager}}, {:process_assignments, "nonode@nohost/<0.2926.0>-3c16cb43-aacf-4924-90ac-1ed0bab18fcc", 872, [{:brod_received_assignment, "test_plant_values", 0, 10688}, {:brod_received_assignment, "test_plant_values", 1, 11088}, {:brod_received_assignment, "test_plant_values", 2, 10966}, {:brod_received_assignment, "test_plant_values", 3, 10627}, {:brod_received_assignment, "test_plant_values", 4, 10973}, {:brod_received_assignment, "test_plant_values", 5, 11055}, {:brod_received_assignment, "test_plant_values", 6, 10864}, {:brod_received_assignment, "test_plant_values", 7, 11070}, {:brod_received_assignment, "test_plant_values", 8, 10855}, {:brod_received_assignment, "test_plant_values", 9, 10835}, {:brod_received_assignment, "test_plant_values", 10, 10985}, {:brod_received_assignment, "test_plant_values", 11, 10932}, {:brod_received_assignment, "test_plant_values", 12, 10957}, {:brod_received_assignment, "test_plant_values", 13, 11001}, {:brod_received_assignment, "test_plant_values", 14, 11019}, {:brod_received_assignment, "test_plant_values", 15, 10988}, {:brod_received_assignment, "test_plant_values", 16, 10948}, {:brod_received_assignment, "test_plant_values", 17, 10784}, {:brod_received_assignment, "test_plant_values", 18, 10868}, {:brod_received_assignment, "test_plant_values", 19, 10780}, {:brod_received_assignment, "test_plant_values", 20, 10922}, {:brod_received_assignment, "test_plant_values", 21, 10972}, {:brod_received_assignment, "test_plant_values", 22, 10885}, {:brod_received_assignment, "test_plant_values", 23, 10986}, {:brod_received_assignment, "test_plant_values", 24, 11006}, {:brod_received_assignment, "test_plant_values", 25, 10999}, {:brod_received_assignment, "test_plant_values", 26, 10952}, {:brod_received_assignment, "test_plant_values", 27, 10909}, {:brod_received_assignment, "test_plant_values", 28, 10850}, {:brod_received_assignment, "test_plant_values", 29, 11028}, {:brod_received_assignment, "test_plant_values", 30, 11081}, {:brod_received_assignment, "test_plant_values", 31, 10812}, {:brod_received_assignment, "test_plant_values", 32, 10819}, {:brod_received_assignment, "test_plant_values", 33, 10848}, {:brod_received_assignment, "test_plant_values", 34, 11004}, {:brod_received_assignment, "test_plant_values", 35, 10969}, {:brod_received_assignment, "test_plant_values", 36, 10847}, {:brod_received_assignment, "test_plant_values", 37, 10973}, {:brod_received_assignment, "test_plant_values", 38, 10873}, {:brod_received_assignment, "test_plant_values", 39, 10965}, {:brod_received_assignment, "test_plant_values", 40, 10954}, {:brod_received_assignment, "test_plant_values", 41, 11085}, {:brod_received_assignment, "test_plant_values", 42, ...}, {:brod_received_assignment, "test_plant_values", ...}, {:brod_received_assignment, ...}, {...}, ...]}, 5000) ** (EXIT) time out (elixir) lib/gen_server.ex:1009: GenServer.call/3 (brod) src/brod_group_coordinator.erl:624: :brod_group_coordinator.sync_group/1 (brod) src/brod_group_coordinator.erl:530: :brod_group_coordinator.do_stabilize/3 (brod) src/brod_group_coordinator.erl:531: :brod_group_coordinator.do_stabilize/3 (brod) src/brod_group_coordinator.erl:343: :brod_group_coordinator.handle_info/2 (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4 (stdlib) gen_server.erl:711: :gen_server.handle_msg/6 (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3 Last message: {:lo_cmd_stabilize, 0, :undefined} State: {:state, :per_partition, "test_datasystem", "", :undefined, 0, ["test_plant_values"], :undefined, :undefined, [], false, {:via, Elsa.Registry, {:elsa_registry_per_partition, Elsa.Group.Manager}}, Elsa.Group.Manager, [], :undefined, :roundrobin_v2, 30, 5, 5, 1, :undefined, :commit_to_kafka_v2, 5, :roundrobin_v2}

m0rt3nlund commented 4 years ago

Currently running version 0.11.1

m0rt3nlund commented 4 years ago

Upgraded to latest, 0.12.1 and now the problem changed. Appears to be a problem with off_broadway_kafka now.

2020-03-06 - 10:32:18.285: [error] ** (exit) exited in: GenServer.call(#PID<0.7699.0>, {:ack, %{generation_id: 896, partition: 144, pid: #PID<0.7699.0>, topic: "test_plant_values"}, [14227, 14228, 14229, 14230, 14231, 14232, 14233, 14234, 14235, 14236, 14237, 14238, 14239, 14240, 14241, 14242, 14243, 14244, 14245, 14246]}, 5000)
    ** (EXIT) time out
    (elixir) lib/gen_server.ex:1009: GenServer.call/3
    (elixir) lib/enum.ex:789: anonymous fn/3 in Enum.each/2
    (stdlib) maps.erl:232: :maps.fold_1/3
    (elixir) lib/enum.ex:1964: Enum.each/2
    (broadway) lib/broadway/consumer.ex:64: Broadway.Consumer.handle_events/3
    (gen_stage) lib/gen_stage.ex:2395: GenStage.consumer_dispatch/6
    (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:711: :gen_server.handle_msg/6
jeffgrunewald commented 4 years ago

Sorry for the delay on getting back @m0rt3nlund but we're taking a look at and trying to knock out all of the open issues with Elsa at the moment. Is this still and issue and if it is, A) what is the current version of Elsa you're using (there have been a few reasonably large changes since early Mar) and B) if yes and you've validated it on a newer version, could you give us some more detail about you environment and app configuration for context?