ChilliCream / graphql-platform

Welcome to the home of the Hot Chocolate GraphQL server for .NET, the Strawberry Shake GraphQL client for .NET and Banana Cake Pop the awesome Monaco based GraphQL IDE.
https://chillicream.com
MIT License
5.15k stars 736 forks source link

Postgres Subscriptions - `listen` should only happen in graphql projects #6694

Open williamdenton opened 10 months ago

williamdenton commented 10 months ago

Is there an existing issue for this?

Product

Hot Chocolate

Describe the bug

We run background jobs as well as api servers. We have recently rolled out HotChocolate Subscriptions using the Postgres Backplane.

A curious scenario that occurred over the weekend was a backend server processed an event that was published into the postgres backplane. This however triggered the subscription system to begin listening to events. The listening system is designed to run a task forever receiving events from the postgres connection notification event. This means that the task processing the event in the backend never finished as it had somehow triggered the listen command.

I'm not sure how this happened yet. The timing of this event does correlate with process startup so there may be some race condition that meant the subscription wasn't initialised fully before a message was published and so was initialised by its first use and that somehow meant it ended up owning the long running task.

In a previous issue/pr i have contributed code that allows the backend servers to publish messages into the backplane for consumption by the API/graphql servers. This means that ITopicSender is registered in DI, but ITopicReceiver is not. PostgresChannel is used by with sending and receiving code paths and executes listen on all code paths. Maybe PostgresChannel needs to be split or allow to be configured so it doesn't listen when in a "SendOnly" configuration as listening to events in a publish only scenario is pointless.

Steps to reproduce

I'm sorry but i dont have a repro yet, but thought i would open an issue to start a discussion in case this sparked an idea of the cause for @PascalSenn

Relevant log output

No response

Additional Context?

No response

Version

13.8

williamdenton commented 10 months ago

maybe the simplest approach is to split this EnsureInitialized()

https://github.com/ChilliCream/graphql-platform/blob/3f7f16001a275117117f24d4bd4891cb0a26cc9e/src/HotChocolate/Core/src/Subscriptions.Postgres/PostgresChannel.cs#L32-L40

so when SendAsync() calls it it doesn't configure the listening connection, just the channel.

https://github.com/ChilliCream/graphql-platform/blob/3f7f16001a275117117f24d4bd4891cb0a26cc9e/src/HotChocolate/Core/src/Subscriptions.Postgres/PostgresChannel.cs#L59-L66

Still not clear to me why (i assume) Initialize on the resilient connection did not complete but i can see from the telemetry we have it was receiving events. I make this assumption as thats the only way i can see that my event processor didn't finish and that telemetry shows postgres events for 12 hours (till we killed it). If it was a startup race condition maybe there is something at play in the Initialize() method of the AsyncTaskDispatcher being reentrant, but the logic looks ok with its semaphoreslim

https://github.com/ChilliCream/graphql-platform/blob/3f7f16001a275117117f24d4bd4891cb0a26cc9e/src/HotChocolate/Core/src/Subscriptions.Postgres/AsyncTaskDispatcher.cs#L32-L56 ... https://github.com/ChilliCream/graphql-platform/blob/3f7f16001a275117117f24d4bd4891cb0a26cc9e/src/HotChocolate/Core/src/Subscriptions.Postgres/AsyncTaskDispatcher.cs#L69-L81