Memory leak + seat reservation expired error

CookedApps commented 2 years ago

Yesterday, we switched our live system to our new Kubernetes setup, utilizing the Colyseus Proxy together with MongoDB and Redis for load balancing. We had a public beta over the last month with about 800 players a day and everything worked fine. But after about 20k players played for a day we were seeing seat reservation expired more and more often up to a point where nobody was able to join or create any lobby.

What we found:

Examining the resource consumption of the Colyseus Proxy over the last 24 hours suggests a memory leak: Colyseus Proxy resource consumption

Our logs repeatedly show these errors:

Using proxy 1 /NLK_aUr7s/HVCFC?sessionId=eGUwvAl7F
Error: seat reservation expired.
  at uWebSocketsTransport.onConnection (/usr/app/node_modules/@colyseus/uwebsockets-transport/build/uWebSocketsTransport.js:118:23)
  at open (/usr/app/node_modules/@colyseus/uwebsockets-transport/build/uWebSocketsTransport.js:59:28)
  at uWS.HttpResponse.upgrade (<anonymous>)
  at upgrade (/usr/app/node_modules/@colyseus/uwebsockets-transport/build/uWebSocketsTransport.js:47:21)
2022-07-21T06:01:15.208Z colyseus:errors Error: seat reservation expired.
  at uWebSocketsTransport.onConnection (/usr/app/node_modules/@colyseus/uwebsockets-transport/build/uWebSocketsTransport.js:118:23)
  at open (/usr/app/node_modules/@colyseus/uwebsockets-transport/build/uWebSocketsTransport.js:59:28)
  at uWS.HttpResponse.upgrade (<anonymous>)
  at upgrade (/usr/app/node_modules/@colyseus/uwebsockets-transport/build/uWebSocketsTransport.js:47:21)

Restarting the proxies fixes the problem temporarily.

Setup:

Colyseus version: 0.14.29
Colyseus proxy version: 0.12.8
Node version: 16.15.1-alpine

Edit: We were running 2 proxies behind a load balancer and 5 gameserver instances. This might be related to #30.

We really need help with this issue, as I am at my wit's end. Thank you in advance! :pray:

damnedOperator commented 2 years ago

+1 Am also out of things to try

endel commented 2 years ago

The memory leak is a known issue unfortunately (https://github.com/OptimalBits/redbird/issues/237), although circumstances are not clear for when it happens. I suspect it's related to TLS termination at the Node/proxy level. In Arena this problem doesn't exist I believe because TLS termination happens at another level (haproxy or nginx)

The upcoming version (0.15, currently in @preview) is introducing an alternative to the proxy, by using a regular load balancer behind all Colyseus nodes, and specifying a public address for each node, you can see the preview (from https://github.com/colyseus/docs/pull/90) here: https://deploy-preview-90--colyseus-docs.netlify.app/colyseus/scalability/#alternative-2-without-the-proxy

If your cluster is at an inconsistent state, I'd recommend checking for the roomcount and colyseus:nodes contents on Redis, they should contain the same amount of entries as you have in Node processes.

damnedOperator commented 2 years ago

Well we terminate Https at the Ingress Controller, which would situate it pretty near to how it's deployed in Arena (Termination is done by a load balancer). So I doubt it has to do with tls termination :/

endel commented 2 years ago

Apparently a user of http-proxy managed to reproduce the memory leak consistently here https://github.com/http-party/node-http-proxy/issues/1586

EDIT: not sure it's the same leak we have, sounds reasonable though

damnedOperator commented 2 years ago

So http-proxy is a dependency of the coly-proxy?

endel commented 2 years ago

Yes, it is!

damnedOperator commented 2 years ago

Sounds like we cannot do anything atm to mitigate this?

CookedApps commented 2 years ago

If your cluster is at an inconsistent state, I'd recommend checking for the roomcount and colyseus:nodes contents on Redis, they should contain the same amount of entries as you have in Node processes.

I am not sure if I understand this. What do you mean with "inconsistent state"? @endel

nzmax commented 2 years ago

We also met this issue several times, so the only solution is to use the 0.15 Preview that @endel provides? Is there any other solution?

nzmax commented 2 years ago

ANY UPDATES?

CookedApps commented 2 years ago

@nzmax It seems to me that the proxy will no longer be fixed. Endel has not commented on this. Looks like we'll have to work with the new architecture in version 0.15. We don't know how this is supposed to work in a Kubernetes environment and are still waiting for news...

endel commented 2 years ago

We do are interested in fixing this issue. We are still trying to reproduce the memory leak issue in a controlled environment. There are 2 things you can do to help:

Provide an isolated project/repo that confidently reproduces the memory leak
Try out the solution proposed on https://github.com/http-party/node-http-proxy/issues/1586 yourself and report back here if that solves the memory leak for you

hunkydoryrepair commented 8 months ago

We were seeing consistent memory leaks, gradually growing over time. we have replaced the version of http-proxy we use with @refactorjs/http-proxy. It was almost a drop in replacement, but seems to export slightly differently, so I had to change the imports, but got it to work in just a couple minutes.

So far, it seems promising. I will update in a week or so if it resolves the issue. It tends to take about a week before our proxies crash.

colyseus / proxy

Memory leak + seat reservation expired error #32

What we found:

Setup: