absinthe-graphql / absinthe

The GraphQL toolkit for Elixir
http://absinthe-graphql.org
Other
4.29k stars 527 forks source link

Mutations are not being triggered properly in clusters consisting of nodes with different number of CPUs each, if default :pool_size settings are used #1342

Closed hubertlepicki closed 1 month ago

hubertlepicki commented 1 month ago

I don't actually know if this could be considered a bug in Absinthe, or something that needs to be changed some way - maybe only be documented and maybe this issue will suffice for that purpose. I think it would be nice if this did not happen but not sure how to prevent this.

In short, I have an Elixir Phoenix application deployed on Kubernetes GKE Autopilot cluster. I have noticed that mutations are not always trigged reliabily. It depended on which pod in the cluster received web request to handle the mutation.

On some pods, the mutation would execute just fine and a subscription is triggered and propagated to JavaScript clients 100% reliabily. On other pods, a mutation happened, it succeeded, but no message was broadcasted to clients at all that a subscription was triggered at all.

I identified what is different about these pods that do not work: they had lowered number of CPU cores assigned to them by the Autopilot cluster. The issue would also fix itself if all pods received fair amount of traffic - the Kubernetes cluster would scale them up all to maximum number of CPUs and the issue was gone. This made it quite difficult to debug.

I think the code over here decides on a topic name, and one of the parameters being taken into consideration is :pool_size

https://github.com/absinthe-graphql/absinthe/blob/3d0823bd71c2ebb94357a5588c723e053de8c66a/lib/absinthe/subscription.ex#L214-L225

Now, the :pool_size defaults to System.schedulers() * 2:

https://github.com/absinthe-graphql/absinthe/blob/3d0823bd71c2ebb94357a5588c723e053de8c66a/lib/absinthe/subscription.ex#L60-L61

My cluster consists of pods that, have 2 CPU cores, and those that have 4 CPU cores. The generated topic names for my mutation were "__absinthe__:proxy:6" vs "__absinthe__:proxy:2", depending on which pod the mutation arrived to from load balancer.

I believe the problem is that the number of shards/proxies started on each pod in the cluster depends on the of CPUs, and the code over here starts different number of proxies, and some topics are simply not being listened on:

https://github.com/absinthe-graphql/absinthe/blob/3d0823bd71c2ebb94357a5588c723e053de8c66a/lib/absinthe/subscription/proxy_supervisor.ex#L15-L16

The fix was specifying a fixed :pool_size, so instead of:

{Absinthe.Subscription, MyApp.Endpoint}

I did

{Absinthe.Subscription, pubsub: MyApp.Endpoint, pool_size: 8}

Again, I don't know what can be done here, but maybe we should think of a mechanism to warn about this (if we can detect?)

benwilson512 commented 1 month ago

@hubertlepicki this is a really interesting problem. I think you are 100% right that if you have different sized pods you'll get different topic counts, and this will just result in missed messages between nodes.

Specifying a fixed count is definitely one solution. M:N topic mapping definitely seems like a problem someone has solved before but nothing jumps immediately to mind.

hubertlepicki commented 1 month ago

@benwilson512 since I wrote the above, I have also confirmed this is happening on Gigalixir. I guess they run a similar set up on Kubernetes cluster. One of my pods on Gigalixir is reporting 32, and the other one 16 by System.schedulers()

Maybe we should add a note to setup instructions that if you're doing a cloud deployment, and you can't control the number of CPUs, you should specify a fixed :pool_size?

hubertlepicki commented 1 month ago

@benwilson512 I have provided the documentation changes here: https://github.com/absinthe-graphql/absinthe/pull/1343/files please merge them if you feel this is beneficial to document.

hubertlepicki commented 1 month ago

@benwilson512 I think this can be closed now which I'm going to do