Open maennchen opened 6 years ago
I am having the same issue.
This may be related to https://github.com/phoenixframework/phoenix/issues/3179
Chris' reply to this is helpful but not the full story, since Absinthe.Phoenix right now is written to expect you to do use Absinthe.Phoenix.Endpoint
in your endpoint module. The code there uses the Endpoint to lookup various pieces of useful information. If we're going to decouple this from the Endpoint we'll need a new module.
In the meantime I highly suggest implementing "readiness" checks alongside your heath checks, where you don't route traffic to your node until it's fully booted.
@benwilson512 are there any recent recommendations on this?
Since I upgraded to Phoenix 1.5, I see the many messages like this in my logs when my server shuts down:
GenServer #PID<0.4431.0> terminating
** (RuntimeError) Pubsub not configured!
Subscriptions require a configured pubsub module.
(absinthe 1.5.1) lib/absinthe/phase/subscription/subscribe_self.ex:109: Absinthe.Phase.Subscription.SubscribeSelf.ensure_pubsub!/1
(absinthe 1.5.1) lib/absinthe/phase/subscription/subscribe_self.ex:20: Absinthe.Phase.Subscription.SubscribeSelf.do_subscription/3
(absinthe 1.5.1) lib/absinthe/pipeline.ex:368: Absinthe.Pipeline.run_phase/3
(absinthe_phoenix 2.0.0) lib/absinthe/phoenix/channel.ex:123: Absinthe.Phoenix.Channel.run/4
(absinthe_phoenix 2.0.0) lib/absinthe/phoenix/channel.ex:91: Absinthe.Phoenix.Channel.run_doc/4
(absinthe_phoenix 2.0.0) lib/absinthe/phoenix/channel.ex:57: Absinthe.Phoenix.Channel.handle_in/3
(phoenix 1.5.3) lib/phoenix/channel/server.ex:316: Phoenix.Channel.Server.handle_info/2
(stdlib 3.12.1) gen_server.erl:637: :gen_server.try_dispatch/4
The error seems to happen for my already-connected clients. Not new clients connecting while the app is being shut down.
The client is connected to the socket, and the socket itself is fine. The highlighted line seems to be when Absinthe.Subscription
is shut down. The client, still connected to the same socket, then tries to rejoin over and over again, but Absinthe.Subscription
isn't running so it gets rejected every time.
Only when Phoenix.Endpoint closes the WebSocket connection will the client try to reconnect to a new socket and there they get to a healthy server.
Any guidance would be much appreciated. Thanks for Absinthe! :)
@sebastianseilund What order do you have things in your supervision tree? Have you considered using something like https://github.com/derekkraan/ranch_connection_drainer ? Ideally you don't keep routing traffic to a node that is shutting down.
Order:
{Phoenix.PubSub, pubsub_config()},
UserflowWeb.Endpoint,
{Absinthe.Subscription, UserflowWeb.Endpoint},
If I put Absinthe.Subscription
before UserflowWeb.Endpoint
it won't start.
I'm not routing any new traffic to my nodes once shutdown starts. As described above, the error happens in the sockets that are already connected (via WebSocket). This is the order of things:
preStop
sleep 5
command)Pubsub not configured!
I haven't looked at https://github.com/derekkraan/ranch_connection_drainer before. Would you put that after Absinthe.Subscription
?
We are seeing the exact same error that @sebastianseilund posted above, for very similar reasons.
@benwilson512 I looked into https://github.com/derekkraan/ranch_connection_drainer. It waits for connections to end before letting Endpoint shut itself down. So I don't think it'll fix the issue here since the WebSocket connections stay open.
Fair. I think the best option here would be to figure out a way to have the supervision tree look like this:
[
Pubsub,
Absinthe.Subscription,
Endpoint
]
Then the endpoint would shutdown before the pubsub. Let me see what I can do.
I think that would be amazing 🙏
@benwilson512 Is there any movement on this? I am seeing ~150 of these errors daily on my production.
@lstrzebinczyk the best solution here is to use a load balancer that routes traffic away from your node on deploy. Even if I split out the pubsub from the endpoint, it will simply result in different issues for your clients. If you aren't routing traffic away from your node on deploy then when they are disconnected they may try to reconnect to the same node which will now fail.
I will try to find time to make the proposed improvements here this week, but it is merely going to help with error messages, it isn't going to solve the root issue. The root issue is that you are not disconnecting traffic properly from a node that is going down.
If you aren't routing traffic away from your node on deploy then when they are disconnected they may try to reconnect to the same node which will now fail.
This doesn't happen. Kubernetes does route new traffic away from pods before shutting down the Elixir container. The existing network connections are not terminated by anything in Kubernetes though.
When the client reconnects, they'll be sent to one of the new pods (or one of the old ones that haven't begun shutdown yet).
The root issue is that you are not disconnecting traffic properly from a node that is going down.
I don't think you'd want to terminate active connections outside of Elixir's control. If you have a regular slow HTTP request, you don't want it to be terminated half-way. That could leave stuff in a bad state, or at least ends up in an unnecessary 5xx error to the client. Just let it drain.
When MyApp.Endpoint
is shut down, it will gracefully terminate the WebSocket connections. Only once this has happened, does it seem safe to shut down Absinthe.Subscription
.
That's fair, there are definitely shutdown patterns that make sense for Elixir to lead. OK, I'll prioritize working on that this week.
any update on this?
Any updates?
I'm encountering this as well. I'm curious if there is an update on the horizon similar to what @benwilson512 seemed to have in mind in https://github.com/absinthe-graphql/absinthe_phoenix/issues/46#issuecomment-658350919.
Experienced this today, and fixed it by adding a supervisor to my application.ex (might have missed it when setting up). Hope this helps.
def start(_type, _args) do
import Supervisor.Spec
children = [
AppWeb.Telemetry,
{Phoenix.PubSub, name: App.PubSub},
AppWeb.Endpoint,
supervisor(Absinthe.Subscription, [AppWeb.Endpoint]) -> Added this
]
end
@alvnrapada's fix didn't seem to work on my application when I was testing by doing a rolling restart of a deployment in k8s... so I made this experimental "hack PR" https://github.com/absinthe-graphql/absinthe_phoenix/pull/89 outlined by https://github.com/absinthe-graphql/absinthe_phoenix/issues/46#issuecomment-651148721. Prior to making the change to my kubernetes deployed app indicated by this PR, I too was receiving the "pubsub not configured" error whenever I was triggering a rolling restart in kubernetes as a test.
I tried adding tests to the PR but Phoenix.ChannelTest helper pathways seemed to expect an ets backed endpoint configuration, and I wasn't quite sure the best way of tackling this at the moment (open to suggestions).
Testing notes:
1) I tested the below on a demo app (nothing with any kind of traffic other than me trying to manually trigger subscriptions)
2) I manually tested a simple query which triggered a subscription and a simple edit mutation which triggered another subscription using the /graphiql
GUI in browser.
3) Doing the rolling restart of a kubernetes deployement with the below no longer seemed to trigger the pubsub not configured
error (at least as much as I could verify by live tailing the logs of a pod).
I will try and put together a demo application esp. if people think it might help in any way with testing (I'd also be curious as to results others might get), but here is the outline of what I did:
def start(_type, _args) do
children = [
# Start the Telemetry supervisor
ExampleWeb.Telemetry,
# Start the PubSub system
# Start the Endpoint (http/https)
{Phoenix.PubSub, name: Example.PubSub},
Example.Presence,
{Absinthe.Subscription, ExampleWeb.Pubsub.Endpoint},
ExampleWeb.Endpoint
]
...# continued
defmodule ExampleWeb.UserSocket do
use Phoenix.Socket
use Absinthe.Phoenix.Socket, schema: ExampleWeb.Schema, endpoint: ExampleWeb.Pubsub.Endpoint
## Channels
# channel "room:*", ExampleWeb.RoomChannel
# Socket params are passed from the client and can
# be used to verify and authenticate a user. After
# verification, you can put default assigns into
# the socket that will be set for all channels, ie
#
# {:ok, assign(socket, :user_id, verified_user_id)}
#
# To deny connection, return `:error`.
#
# See `Phoenix.Token` documentation for examples in
# performing token verification on connect.
@impl true
def connect(_params, socket, _connect_info) do
{:ok, socket}
end
# Socket id's are topics that allow you to identify all sockets for a given user:
#
# def id(socket), do: "user_socket:#{socket.assigns.user_id}"
#
# Would allow you to broadcast a "disconnect" event and terminate
# all active sockets and channels for a given user:
#
# ExampleWeb.Endpoint.broadcast("user_socket:#{user.id}", "disconnect", %{})
#
# Returning `nil` makes this socket anonymous.
@impl true
def id(_socket), do: nil
# omitted for brevity...
end
defmodule ExampleWeb.Endpoint do
use Phoenix.Endpoint, otp_app: :example
# use Absinthe.Phoenix.Endpoint
# The session will be stored in the cookie and signed,
# this means its contents can be read but not tampered with.
# Set :encryption_salt if you would also like to encrypt it.
@session_options [
store: :cookie,
key: "_Example_key",
signing_salt: "sgWFmDOb"
]
socket("/socket", ExampleWeb.UserSocket,
websocket: [check_origin: false],
longpoll: false
)
socket("/live", Phoenix.LiveView.Socket, websocket: [connect_info: [session: @session_options]])
#...omitted for brevity
end
defmodule ExampleWeb.Pubsub.Endpoint do
use Absinthe.Phoenix.Subscription.Pubsub, otp_app: :example
end
:ok =
Absinthe.Subscription.publish(ExampleWeb.Pubsub.Endpoint, %{users: user_list},
new_users: "new_users"
)
config :example, ExampleWeb.Pubsub.Endpoint, pubsub_server: Example.PubSub
In terms of my k8 deployment.yaml config, it might also be worth noting I had a readiness probe and a terminationGracePeriod seconds of 65 like this (abbreviated yaml):
terminationGracePeriodSeconds: 65
containers:
- name: example-web
# the same name and tag must be used when building the image
image: gcr.io/project_id_2200/example_web:prod
imagePullPolicy: Always
ports:
- containerPort: 80
readinessProbe:
httpGet:
path: /lbcheck
port: 5000
initialDelaySeconds: 60
we've just started seeing this issue in the last week or so, wondering if we can help at all?
Has anyone been able to reproduce this issue in a locally running application? That might simplify testing any proposed fixes here.
It happens in our dev enviroment every time it restarts while running under watchexec
, so I can test proposed solutions quite easily. @alvnrapada's fix did not work for me.
So here's what I'm trying to do to get the supervision tree to end with the endpoint:
The supervision tree setup:
children = [
{Phoenix.PubSub, [name: MyappWeb.PubSub, adapter: Phoenix.PubSub.PG2]},
{Absinthe.Subscription, MyappWeb.Endpoint},
MyappWeb.Endpoint,
]
The Endpoint for my app:
defmodule MyappWeb.Endpoint do
# Important that this goes before `use Phoenix.Endpoint`
use MyappWeb.OverridePhoenixEndpoint, otp_app: :myapp
use Phoenix.Endpoint, otp_app: :myapp
use Absinthe.Phoenix.Endpoint
# ...
end
The overriding for Phoenix.Endpoint.pubsub_server!/0
defmodule MyappWeb.OverridePhoenixEndpoint do
defmacro __using__([otp_app: app_name]) do
quote do
defp pubsub_server! do
server = if :ets.whereis(__MODULE__) != :undefined do
config(:pubsub_server)
else
Application.get_env(unquote(app_name), __MODULE__) |> Keyword.get(:pubsub_server)
end
server || raise ArgumentError, "no :pubsub_server configured for #{inspect(__MODULE__)}"
end
end
end
end
This seems to me to be working, but also seems like it's not the greatest idea in the world. Does anyone have any better suggestions or reasons why this is dangerous?
I'm seeing this error as well in my local environment. I'm attempting to capture process exit's to gracefully shutdown. When I issue a System.stop()
in IEX to initiate a shutdown, I see a number of Absinthe errors. That may be a way to help reproduce it locally.
I was also able to reproduce sending a sigterm signal to the process running mix phx.server
locally. My steps were:
1) Start phoenix mix phx.server
2) Subscribe to any graphql subscription
3) Kill the process, pkill -f "mix phx.server"
(MacOS)
For reference, sigterm is the signal sent by AWS ECS system to shutdown task containers. "Today, ECS always sends a SIGTERM" (https://aws.amazon.com/blogs/containers/graceful-shutdowns-with-ecs/)
I was also able to repro in our codebase by sending SIGTERM to the process while there's an active subscription.
Any news here? It seems that having Subscription and Endpoint causes race conditions and your Supervision tree ends up being unpredictable
This issue still occurs, though I only get one error every few days running in prod. Anyone found a "neat" fix?
there is (a bit dirty) fix that we use now in production: https://github.com/prosapient/absinthe/commit/39e5a927cc551afa215309bcf22afcd1a4ecc8ec
note: it handles SIGTERM (perfect for k8s) - just to understand when pubsub is not configured and when there is a progress with app shutdown.
@benwilson512 Any movement on this issue? We have been implementing more subscription-based functionality in our app and as a consequence we're seeing this PubSub failure a lot more now when users are connected to the socket during a restart. @bfad 's solution technically does work - but we get a lot of warning logs like
warning: this clause for pubsub_server!/0 cannot match because a previous clause at line 2 always matches
lib/my_app_web/endpoint.ex:3
With that solution, killing the app (with :init.stop()
in IEx) results in a log message like
18:48:34.553 [info] Shutting down 1 sockets in 1 rounds of 2000ms
before closing gracefully.
Sentry catches a lot of
RuntimeError
s sayingPubsub not configured! Subscriptions require a configured pubsub module.
.I could not find out in which cases it is failing, i just see around 600 errors over a timespan of 30 days. I've never experienced the error myself and it only occurs in a low percentage of all the subscriptions.
Full Exception
Versions
Setup
*
Absinthe.Subscription
added toApplication
Absinthe.Phoenix.Socket
in Socket ImplementationAbsinthe.Plug
(viaApolloTracing.Pipeline
)