FCM Connection Crash - Githubissues

msmykowski commented 3 years ago

Environment

Elixir & Erlang/OTP versions (elixir --version): 1.11.2
Operating system: Alpine
Kadabra: 0.4.4
Pigeon: 1.6

Current behavior

We send a high volume of alerts in the magnitude of millions in single bursts. We are seeing very frequent FCM connection crashes resulting in dropped messages.

Process #PID<0.17002.146> terminating
** (exit) exited in: GenServer.call(nil, {:request, [%Kadabra.Request{body: "{__REQUEST BODY__}, 5000)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.11.2) lib/gen_server.ex:1017: GenServer.call/3
    (pigeon 1.6.0) lib/pigeon/connection.ex:145: Pigeon.Connection.send_push/3
    (elixir 1.11.2) lib/enum.ex:2181: Enum."-reduce/3-lists^foldl/2-0-"/3
    (pigeon 1.6.0) lib/pigeon/connection.ex:97: Pigeon.Connection.handle_events/3
    (gen_stage 1.0.0) lib/gen_stage.ex:2395: GenStage.consumer_dispatch/6
    (stdlib 3.13.2) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13.2) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13.2) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

We are starting our connection in a bit of a dated way within a Genserver.

{:ok, conn} = Pigeon.FCM.start_connection(key: Config.get([:fcm, :fcm_key]))

Expected behavior

We would expect the connection to be reestablished before attempting to send the request.

hpopp commented 3 years ago

So, this is interesting. http/2 connections definitely can't handle >1 million requests over their lifetime, but the GenStage implementation should properly queue while it sets up a new connection. It's been a while since I tested it, and my production workloads generally don't hit that limit very often.

I'll dig a little bit and see what's up. If you're sending that kind of volume though, I'd recommend starting multiple workers and round-robining between them. I may consider adding a pool_size option in v2 to make this easier, but most workloads typically run just fine on one connection.

msmykowski commented 3 years ago

@hpopp If it helps we identified the race condition happening here -- https://github.com/codedge-llc/pigeon/blob/master/lib/pigeon/connection.ex#L93. Since Genstage.cancel is async, at a high enough throughput this guarantees messages will be processed with a nil socket.

Our quick patch was to reopen a new socket when the old one is closed rather than setting the socket to nil and shutting down the consumer process. It has fixed the issue posted above, but I believe there is still a condition where push notifications can be pushed into a kadabra connection just prior to it being closed, resulting in on_response not being called.

codedge-llc / pigeon

FCM Connection Crash #186

Environment

Current behavior

Expected behavior