davidyaha / graphql-redis-subscriptions

A graphql subscriptions implementation using redis and apollo's graphql-subscriptions
MIT License
1.11k stars 125 forks source link

Race condition with simultaneous subscribe + quick unsubscribe #598

Closed mlaflamm closed 4 months ago

mlaflamm commented 1 year ago

We ran into a tricky race condition when RedisPubSub is handling multiple simultaneous subscribe calls on the same channel. We have a client that sometimes initiates multiple subscribe and quickly unsubscribes to them to keep only the last subscription active (thank you react hook!). This led us to discover the race condition I am reporting here.

In the problematic scenario, no messages are propagated to the remaining subscription and an error There is no subscription of id "*"is raised when the client finally unsubscribes from that remaining subscription. That error sometimes crashes our server. I suspect the crash is related to error handling (or lack thereof) in graphql-subscriptions or gqphql-ws modules as it happens only when the websocket is closed abruptly by the client (e.g. closing the browser).

After intensive debugging sessions, I figured out the exact flow causing the race condition. Here is my attempt to summarize it.

I'll submit a PR with a fix.

jstaro commented 1 year ago

We're seeing this sometimes as well. I've not been successful in reproducing it, so thanks for doing all the hard work.

Meemaw commented 1 year ago

Also seeing this. Its kinda bad that its impossible to fix this from the app perspective, as it happens deeply in the library which results in an uncaught exception and thus a server restart.

jalagar commented 1 year ago

Following here 🙏

pantajoe commented 1 year ago

@davidyaha Would be great if the PR by @mlaflamm could get merged into a new release 🙏🏽

trixobird commented 1 year ago

Could we have an update on this issue? @mlaflamm did you manage to craft a PR? If not do you want to share your thoughts on how would you fix it to see if I can pick it up?

jstaro commented 1 year ago

@trixobird There is an open PR (linked to in this issue). FWIW, we've been running this PR in production and it solved our issue.

chkp-talron commented 1 year ago

we also encountered this error in production. unlclear if this is the same use case, but would love to see this PR merged.

BrainEno commented 10 months ago

having the same problem

davidyaha commented 4 months ago

Fixed with #599