codedge-llc / pigeon

iOS and Android push notifications for Elixir
https://hex.pm/packages/pigeon
MIT License
645 stars 145 forks source link

Pigeon APNS crashes the whole app on a connectivity timeout #203

Closed nicnilov closed 11 months ago

nicnilov commented 3 years ago

Environment

  def start(_type, _args) do
    children = [
      YourApp.APNS
    ]
    opts = [strategy: :one_for_one, name: YourApp.Supervisor]
    Supervisor.start_link(children, opts)
  end

Current behavior

Here is a log. What happens is Pigeon is trying to reestablish the apns connection unsuccessfully and errors out, bringing the whole application down with it.

13:04:36.222 [info]  Sent 200 in 8ms
13:13:09.260 [error]  GenServer #PID<0.989.0> terminating
** (CaseClauseError) no case clause matching: {:stop, :timeout}
    (pigeon 2.0.0-rc.0) lib/pigeon/dispatcher_worker.ex:33: Pigeon.DispatcherWorker.handle_info/2
    (stdlib 3.15.1) gen_server.erl:695: :gen_server.try_dispatch/4
    (stdlib 3.15.1) gen_server.erl:771: :gen_server.handle_msg/6
    (stdlib 3.15.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {:closed, #PID<0.990.0>}
State: %{adapter: Pigeon.APNS, state: %Pigeon.APNS{config: %Pigeon.APNS.Config{cert: <<48, 130, 6, 68, 48, 130, 5, 44, 160, 3, 2, 1, 2, 2, 8, 29, 143, 123, 34, 49, 202, 38, 88, 48, 13, 6, 9, 42, 134, 72, 134, 247, 13, 1, 1, 11, 5, 0, 48, 129, 150, 49, 11, 48, 9, 6, ...>>, key: {:RSAPrivateKey, <<48, 130, 4, 165, 2, 1, 0, 2, 130, 1, 1, 0, 196, 16, 166, 157, 130, 84, 49, 137, 149, 83, 102, 96, 132, 189, 237, 35, 131, 32, 45, 70, 228, 73, 197, 20, 99, 107, 83, 227, 86, 93, 3, ...>>}, ping_period: 600000, port: 443, uri: "api.development.push.apple.com"}, queue: %{}, socket: #PID<0.990.0>, stream_id: 3}}
13:13:09.320 [info]  Application myapp exited: shutdown

This happens locally when the application is running while the apns is not reachable because of the connectivity conditions. This is not a concern for local, however could happen on production. In a scenario when the application cannot reach the apns for about 10 minutes, it will be shut down altogether, which is not ideal.

In more detail, the following seems to be happening. Please note this is cursory and speculative, so may contain mistakes.

The adapter detects the disconnection and attempts to reconnect, L:206:

https://github.com/codedge-llc/pigeon/blob/5638e192bff639b5bb675388f661fb307a41be27/lib/pigeon/apns.ex#L205-L214

connect_socket/2 tries several times and eventually returns {:error, :timeout}:

https://github.com/codedge-llc/pigeon/blob/5638e192bff639b5bb675388f661fb307a41be27/lib/pigeon/apns.ex#L226

This value gets transformed to {:stop, :timeout} at L:212:

https://github.com/codedge-llc/pigeon/blob/5638e192bff639b5bb675388f661fb307a41be27/lib/pigeon/apns.ex#L211-L212

This bubbles up to the DispatcherWorker where the logged error happens because no clause is present that would handle a 2-element {:stop, ...} tuple:

https://github.com/codedge-llc/pigeon/blob/5638e192bff639b5bb675388f661fb307a41be27/lib/pigeon/dispatcher_worker.ex#L32-L40

This is where it gets out of hand. Since the worker has terminated, it is getting restarted by the :one_for_one strategy of the Dispatcher:

https://github.com/codedge-llc/pigeon/blob/5638e192bff639b5bb675388f661fb307a41be27/lib/pigeon/dispatcher.ex#L114-L126

This eventually invokes the init/1 function of the APNS adapter. This function also tries to establish the connection at L:171:

https://github.com/codedge-llc/pigeon/blob/5638e192bff639b5bb675388f661fb307a41be27/lib/pigeon/apns.ex#L165-L179

This also fails by a timeout, but this time, since it's a process start function that returns an error, the whole supervisor gets shut down, and with it, the whole app. This is a documented behavior:

If the start function of any of the child processes fails or returns an error tuple or an erroneous value, the supervisor first terminates with reason :shutdown all the child processes that have already been started, and then terminates itself and returns {:error, {:shutdown, reason}}.

Expected behavior

A graceful handling of connectivity errors. Maybe a configurable option to reconnect indefinitely. Maybe a change of the internal state, such that a call to APNS.push would return an error when not currently connected.

hpopp commented 3 years ago

Thanks for the heads up on this. I'll work out a better solution and put out an rc.1

Funny enough, early on in pigeon's development it used to reconnect indefinitely instead of crashing the worker. It worked great except on occasion when APNS servers would be down, which would lead to a sustained 100% CPU spike for the duration of the outage 😅

morgz commented 3 years ago

thanks for documenting this @nicnilov - I've also run into this but in my case, it seems to stop mix phx.server in it's tracks if I don't have an internet connection. Found this while trying to develop on a train!

** (Mix) Could not start application conversational: Conversational.Application.start(:normal, []) returned an error: shutdown: failed to start child: Conversational.APNS
    ** (EXIT) shutdown: failed to start child: 1
        ** (EXIT) an exception was raised:
            ** (CaseClauseError) no case clause matching: {:stop, :timeout}
                (pigeon 2.0.0-rc.0) lib/pigeon/dispatcher_worker.ex:12: Pigeon.DispatcherWorker.init/1
                (stdlib 3.16.1) gen_server.erl:423: :gen_server.init_it/2
                (stdlib 3.16.1) gen_server.erl:390: :gen_server.init_it/6
                (stdlib 3.16.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
nicnilov commented 2 years ago

@hpopp Just a heads up, this did bring two production instances down. As this issue did not receive attention for about 6 months already, I got to ask, should a fix be expected, realistically, in a near future?

mjrusso commented 1 year ago

In conjunction with #246 (thanks @mdlkxzmcp), the following code seems to prevent the various connectivity issues from taking down the whole app:

# Pigeon dispatchers will crash on startup if there is a connectivity issue
# with their push provider. The implication here is that the application won't
# start if (for example) APNS is down when the app attempts to boot.
#
# Additionally, Pigeon dispatchers will crash after startup if connectivity
# with the push provider is interrupted. Unless handled properly, these crashes
# can bring down the entire application.
#
# These behaviours are problematic for application stability. These supervisors
# work around these issues.
#
# See https://github.com/codedge-llc/pigeon/issues/203 for more details and
# discussion.
defmodule Notifications.Pigeon do
  require Logger

  defmodule ParentSupervisor do
    use Supervisor

    def start_link(init_arg) do
      Supervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
    end

    @impl true
    def init(_init_arg) do
      # This implementation enables the referenced dynamic supervisor to start
      # its children immediately after the dynamic supervisor boots. It
      # accomplishes this by starting a `DynamicSupervisor` with a `Task` under
      # a `rest_for_one` supervisor. For more details on this approach, see:
      #
      # - https://elixirforum.com/t/understanding-dynamicsupervisor-no-initial-children/14938/2
      # - https://github.com/slashdotdash/til/blob/master/elixir/dynamic-supervisor-start-children.md

      children = [
        Notifications.Pigeon.ChildSupervisor,
        {Task, &Notifications.Pigeon.ChildSupervisor.start_children/0}
      ]

      opts = [strategy: :rest_for_one]

      Supervisor.init(children, opts)
    end
  end

  defmodule ChildSupervisor do
    # This implementation handles the case where a child fails to start:
    # instead of crashing immediately, periodically retry to start the
    # child (and don't give up until the child successfully starts).

    use DynamicSupervisor

    def start_link(init_arg) do
      DynamicSupervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
    end

    defp start_child(child) do
      case DynamicSupervisor.start_child(__MODULE__, child) do
        {:ok, _} ->
          :ok

        {:error, _} ->
          Logger.error("Unable to start child #{child}; will retry")
          Process.sleep(1_000)
          start_child(child)
      end
    end

    def start_children do
      children = [
        Notifications.Push.APNS.Prod,
        Notifications.Push.APNS.Dev
      ]

      for child <- children do
        spawn(fn -> start_child(child) end)
      end
    end

    @impl true
    def init(_init_arg) do
      # Set `:max_restarts` to a high value, to prevent the supervisor from
      # crashing if the push provider goes down at any point after the child
      # has started. (Pigeon will crash on a connectivity issue with a
      # provider; it's important that it doesn't take down the rest of the
      # application while it's trying to start back up.)
      opts = [
        strategy: :one_for_one,
        max_restarts: 1_000_000_000,
        max_seconds: 1
      ]

      DynamicSupervisor.init(opts)
    end
  end
end

In my preliminary testing this seems robust. (Please let me know if there's a better way... I'm new around these parts 😄 ).

For posterity:

Processes that talk to the outside world should have no expectations that the downstream service is available when they start. Otherwise you’ll end with a system that can’t recover from transient failures and probably can’t even boot reliably.

(Source)

nicnilov commented 4 months ago

The @mdlkxzmcp's commit https://github.com/codedge-llc/pigeon/commit/d1305ecd3969fd683c020eb1eab04117b4bac936 fixes part of it, but shouldn't the case block at L38 https://github.com/codedge-llc/pigeon/blob/05fe8554db182c87b64775ffe767a357312c41ab/lib/pigeon/dispatcher_worker.ex#L37-L45 also handle the 2-element tuples, e.g.:

      {:stop, reason} ->
        {:stop, reason, %{adapter: adapter, state: state}}

Otherwise the exception is not about a (e.g.) timeout, but about a missing case clause:

** (CaseClauseError) no case clause matching: {:stop, :timeout}
    (pigeon 2.0.0-rc.2) lib/pigeon/dispatcher_worker.ex:38: Pigeon.DispatcherWorker.handle_info/2
    (stdlib 5.2) gen_server.erl:1095: :gen_server.try_handle_info/3
    (stdlib 5.2) gen_server.erl:1183: :gen_server.handle_msg/6
    (stdlib 5.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3

This still works with the code provided by @mjrusso, but reports an incorrect exception reason.

mdlkxzmcp commented 4 months ago

@nicnilov just made a PR to fix that :)