Closed nicnilov closed 11 months ago
Thanks for the heads up on this. I'll work out a better solution and put out an rc.1
Funny enough, early on in pigeon's development it used to reconnect indefinitely instead of crashing the worker. It worked great except on occasion when APNS servers would be down, which would lead to a sustained 100% CPU spike for the duration of the outage 😅
thanks for documenting this @nicnilov - I've also run into this but in my case, it seems to stop mix phx.server
in it's tracks if I don't have an internet connection. Found this while trying to develop on a train!
** (Mix) Could not start application conversational: Conversational.Application.start(:normal, []) returned an error: shutdown: failed to start child: Conversational.APNS
** (EXIT) shutdown: failed to start child: 1
** (EXIT) an exception was raised:
** (CaseClauseError) no case clause matching: {:stop, :timeout}
(pigeon 2.0.0-rc.0) lib/pigeon/dispatcher_worker.ex:12: Pigeon.DispatcherWorker.init/1
(stdlib 3.16.1) gen_server.erl:423: :gen_server.init_it/2
(stdlib 3.16.1) gen_server.erl:390: :gen_server.init_it/6
(stdlib 3.16.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
@hpopp Just a heads up, this did bring two production instances down. As this issue did not receive attention for about 6 months already, I got to ask, should a fix be expected, realistically, in a near future?
In conjunction with #246 (thanks @mdlkxzmcp), the following code seems to prevent the various connectivity issues from taking down the whole app:
# Pigeon dispatchers will crash on startup if there is a connectivity issue
# with their push provider. The implication here is that the application won't
# start if (for example) APNS is down when the app attempts to boot.
#
# Additionally, Pigeon dispatchers will crash after startup if connectivity
# with the push provider is interrupted. Unless handled properly, these crashes
# can bring down the entire application.
#
# These behaviours are problematic for application stability. These supervisors
# work around these issues.
#
# See https://github.com/codedge-llc/pigeon/issues/203 for more details and
# discussion.
defmodule Notifications.Pigeon do
require Logger
defmodule ParentSupervisor do
use Supervisor
def start_link(init_arg) do
Supervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
end
@impl true
def init(_init_arg) do
# This implementation enables the referenced dynamic supervisor to start
# its children immediately after the dynamic supervisor boots. It
# accomplishes this by starting a `DynamicSupervisor` with a `Task` under
# a `rest_for_one` supervisor. For more details on this approach, see:
#
# - https://elixirforum.com/t/understanding-dynamicsupervisor-no-initial-children/14938/2
# - https://github.com/slashdotdash/til/blob/master/elixir/dynamic-supervisor-start-children.md
children = [
Notifications.Pigeon.ChildSupervisor,
{Task, &Notifications.Pigeon.ChildSupervisor.start_children/0}
]
opts = [strategy: :rest_for_one]
Supervisor.init(children, opts)
end
end
defmodule ChildSupervisor do
# This implementation handles the case where a child fails to start:
# instead of crashing immediately, periodically retry to start the
# child (and don't give up until the child successfully starts).
use DynamicSupervisor
def start_link(init_arg) do
DynamicSupervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
end
defp start_child(child) do
case DynamicSupervisor.start_child(__MODULE__, child) do
{:ok, _} ->
:ok
{:error, _} ->
Logger.error("Unable to start child #{child}; will retry")
Process.sleep(1_000)
start_child(child)
end
end
def start_children do
children = [
Notifications.Push.APNS.Prod,
Notifications.Push.APNS.Dev
]
for child <- children do
spawn(fn -> start_child(child) end)
end
end
@impl true
def init(_init_arg) do
# Set `:max_restarts` to a high value, to prevent the supervisor from
# crashing if the push provider goes down at any point after the child
# has started. (Pigeon will crash on a connectivity issue with a
# provider; it's important that it doesn't take down the rest of the
# application while it's trying to start back up.)
opts = [
strategy: :one_for_one,
max_restarts: 1_000_000_000,
max_seconds: 1
]
DynamicSupervisor.init(opts)
end
end
end
In my preliminary testing this seems robust. (Please let me know if there's a better way... I'm new around these parts 😄 ).
For posterity:
Processes that talk to the outside world should have no expectations that the downstream service is available when they start. Otherwise you’ll end with a system that can’t recover from transient failures and probably can’t even boot reliably.
(Source)
The @mdlkxzmcp's commit https://github.com/codedge-llc/pigeon/commit/d1305ecd3969fd683c020eb1eab04117b4bac936 fixes part of it, but shouldn't the case block at L38 https://github.com/codedge-llc/pigeon/blob/05fe8554db182c87b64775ffe767a357312c41ab/lib/pigeon/dispatcher_worker.ex#L37-L45 also handle the 2-element tuples, e.g.:
{:stop, reason} ->
{:stop, reason, %{adapter: adapter, state: state}}
Otherwise the exception is not about a (e.g.) timeout, but about a missing case clause:
** (CaseClauseError) no case clause matching: {:stop, :timeout}
(pigeon 2.0.0-rc.2) lib/pigeon/dispatcher_worker.ex:38: Pigeon.DispatcherWorker.handle_info/2
(stdlib 5.2) gen_server.erl:1095: :gen_server.try_handle_info/3
(stdlib 5.2) gen_server.erl:1183: :gen_server.handle_msg/6
(stdlib 5.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
This still works with the code provided by @mjrusso, but reports an incorrect exception reason.
@nicnilov just made a PR to fix that :)
Environment
elixir --version
): Elixir 1.12.2 (compiled with Erlang/OTP 24)2.0.0-rc.0
Current behavior
Here is a log. What happens is Pigeon is trying to reestablish the apns connection unsuccessfully and errors out, bringing the whole application down with it.
This happens locally when the application is running while the apns is not reachable because of the connectivity conditions. This is not a concern for local, however could happen on production. In a scenario when the application cannot reach the apns for about 10 minutes, it will be shut down altogether, which is not ideal.
In more detail, the following seems to be happening. Please note this is cursory and speculative, so may contain mistakes.
The adapter detects the disconnection and attempts to reconnect, L:206:
https://github.com/codedge-llc/pigeon/blob/5638e192bff639b5bb675388f661fb307a41be27/lib/pigeon/apns.ex#L205-L214
connect_socket/2
tries several times and eventually returns{:error, :timeout}
:https://github.com/codedge-llc/pigeon/blob/5638e192bff639b5bb675388f661fb307a41be27/lib/pigeon/apns.ex#L226
This value gets transformed to
{:stop, :timeout}
at L:212:https://github.com/codedge-llc/pigeon/blob/5638e192bff639b5bb675388f661fb307a41be27/lib/pigeon/apns.ex#L211-L212
This bubbles up to the
DispatcherWorker
where the logged error happens because no clause is present that would handle a 2-element{:stop, ...}
tuple:https://github.com/codedge-llc/pigeon/blob/5638e192bff639b5bb675388f661fb307a41be27/lib/pigeon/dispatcher_worker.ex#L32-L40
This is where it gets out of hand. Since the worker has terminated, it is getting restarted by the
:one_for_one
strategy of theDispatcher
:https://github.com/codedge-llc/pigeon/blob/5638e192bff639b5bb675388f661fb307a41be27/lib/pigeon/dispatcher.ex#L114-L126
This eventually invokes the
init/1
function of theAPNS
adapter. This function also tries to establish the connection at L:171:https://github.com/codedge-llc/pigeon/blob/5638e192bff639b5bb675388f661fb307a41be27/lib/pigeon/apns.ex#L165-L179
This also fails by a timeout, but this time, since it's a process start function that returns an error, the whole supervisor gets shut down, and with it, the whole app. This is a documented behavior:
Expected behavior
A graceful handling of connectivity errors. Maybe a configurable option to reconnect indefinitely. Maybe a change of the internal state, such that a call to
APNS.push
would return an error when not currently connected.