Geckos 2.0 http server hangs after a while

bennlich commented 2 years ago

Hey, there! I recently upgraded to geckos 2.0. Everything has been working swimmingly, except for one mysterious problem. After about two days of runtime, my geckos server starts timing out all requests. It sits behind an NGINX proxy, so the first error I see is:

2021/11/09 15:44:56 [error] 15785#15785: *5965624 upstream timed out (110: Connection timed out) while reading response header from upstream, request: "POST /geckos/.wrtc/v2/connections HTTP/1.1", upstream: "http://127.0.0.1:3030/.wrtc/v2/connections"

eventually followed by the similar:

2021/11/09 16:06:34 [error] 15785#15785: *5969357 upstream timed out (110: Connection timed out) while connecting to upstream, request: "POST /geckos/.wrtc/v2/connections HTTP/1.1", upstream: "http://127.0.0.1:3030/.wrtc/v2/connections"

I haven't found any correlated errors coming from the geckos server itself. It just suddenly starts hanging and won't reply anymore :( CPU and MEM usage seem normal.

Any idea how best to dig into this?

yandeu commented 2 years ago

Hmmm,

Could it be that a promise is hanging somewhere?

Should we try to keep track of unfulfilled promises? Is this even possible?

bennlich commented 2 years ago

There's a very popular coding manual with something to say about this:

Screenshot from 2021-11-10 22-14-31

jk jk

I think the point of async / promises is that they don't block anything when they dangle, no? I guess I'll read through geckos and try to find a spot that could end up being blocking.

Is there anything that changed architecturally between 1.0 and 2.0 that comes to mind as a possible culprit?

yandeu commented 2 years ago

I tried to change as little as possible. But if you say it never happened in v1, the bug will be somewhere here: https://github.com/geckosio/geckos.io/commit/21a857f0eacbca2406c848214a7c58da34f568c4

yandeu commented 2 years ago

Do you use some special functionalities like authorization?
Do you use it with express.js or standalone?
Are you sending a lot of "reliable messages"?

bennlich commented 2 years ago

No authorization
Standalone (not w/ express, just io.listen(3030))
Exclusively sending messages with channel.raw.emit(payload) on the server and channel.emit(key, payload) on the client

Before upgrading to 2.0, my geckos server would eventually enter a 100% CPU state after running for a while and need to be rebooted. This doesnt happen anymore, which is cool 😎

yandeu commented 2 years ago

Can you please try to use express instead of geckos' server? https://github.com/geckosio/geckos.io#express

I suspect maybe something here (https://github.com/geckosio/geckos.io/blob/master/packages/server/src/httpServer/httpServer.ts) is wrong.

bennlich commented 2 years ago

Can you please try to use express instead of geckos' server?

Kk. I just deployed this change. I'll let you know if it helps.

yandeu commented 2 years ago

I think I will create a middleware for expressjs (and similar frameworks) and use an already existing (and stable) http server for the "standalone mode".

[ ] Create middleware for express-like frameworks.
[ ] Migrate to an already existing stable http server for "standalone mode".

bennlich commented 2 years ago

Using express did not fix the issue :/

Probably we should write some kind of high-throughput stress test, but I'm not sure when I'll get around to this. One path toward debugging would be:

1) Reproduce in isolation 2) strace that bad boy

But I think in my case it's going to be easier to just write a periodic heartbeat check and reboot.

bennlich commented 2 years ago

I wrote a heartbeat script that makes an OPTIONS request to /.wrtc/v2, but this didn't actually catch the crash. The server responded to this request even after it entered the bad state.

So I guess I was wrong -- the server process doesn't get totally blocked. The symptom is specifically that POST requests to /geckos/.wrtc/v2/connections start timing out, and the server stops receiving webrtc messages.

EDIT: FWIW the OPTIONS heartbeat check did actually seem to work most of the time. It restarted my server about once a day, which was about the frequency I was seeing the issue before.

yandeu commented 2 years ago

Maybe you reached the maximum total connection??

What about the other 3 routes?:

'POST': /connections/[0-9a-zA-Z]+/remote-description
'GET':  /connections/[0-9a-zA-Z]+/additional-candidates
'POST': /connections/[0-9a-zA-Z]+/close

yandeu commented 2 years ago

Please try version 2.1.2, it has a lot of changes.

marcwehbi commented 2 years ago

Hi @yandeu I seem to be having the exact same issue :( Whats funny is that it happens on my NA server but not on EU. The node JS app doesnt give any error or anything, it just completly stops responding until I restart it. Its using https.createServer, but the exact same code is running fine for months on my EU server, I just released the same code on a new Server and it keeps hanging every hour or so, it doesnt even crash the app so things like pm2 cannot auto-restart it or anything. It seems the entire process gets hung even tho there isnt any cpu or mem spike. I upgraded to 2.1.2 as suggested but the issue is still persisting.

bennlich commented 2 years ago

Whats funny is that it happens on my NA server but not on EU.

Is your NA server getting more traffic than EU?

marcwehbi commented 2 years ago

No. its not related to traffic, it happens constantly even in low traffic time. I will try making an image of my EU server and deploy it in NA to see if that fixes it (for tomorrow) :)

yandeu commented 2 years ago

I have just deployed the simple-chat-app-example (httpServer branch) to http://35.177.165.30:3000/.

Will see if, and when, it happens to me.

yandeu commented 2 years ago

@marcwehbi, Do you use google's STUN server? Could this be the issue?

Maybe try Twilio's STUN server: https://www.twilio.com/stun-turn/pricing

marcwehbi commented 2 years ago

Yeah I'm using the google ones (on both EU and NA) I will use Twilio's on on NA tomorrow and see if that helps aswell.

marcwehbi commented 2 years ago

I've just swapped to Twilio's STUN server, will see in an hour or so if it freezes again. If it does im just gonna destroy the droplet and restore it from the EU image to see if maybe its some server configuration i overlooked.

marcwehbi commented 2 years ago

It still froze :( so I will try the image swapping next.

yandeu commented 2 years ago

It still froze

Already? My chat app is still running fine.

Maybe we should ask @murat-dogan and @paullouisageneau for help.

marcwehbi commented 2 years ago

Yeah.. It freezes every hour or so, sometimes less.

But the European one is actually running 100% fine for months, even had ~400 connected users constantly for 3 days in a row (different users in and out), and its still running, thats why I'm really not sure why on the new server its not behaving the same, I hope the image copy will fix it but need to wait for digitalOcean to allow me another droplet... Will keep you posted :)

paullouisageneau commented 2 years ago

There was a possible deadlock in usrsctp that could cause the same symptom: the usrsctp send function never returned and the even loop ended up stuck as a result. However, I shipped the fix in libdatachannel v0.15.3 which corresponds to node-datachannel v0.1.11, so I guess the fix should be in v2.1.2.

@bennlich @marcwehbi Are you familiar with gdb? To check whether it's a similar bug, you can investigate the stuck process by attaching it with gdb to obtain the backtraces of all threads with gdb -p [pid] -ex "thread apply all bt" -ex "quit". It allows to check if a thread is stuck attempting to lock a mutex, for instance. Please post the backtraces, I'll have a look.

marcwehbi commented 2 years ago

hi @paullouisageneau I am not familiar with GDB but I think I managed it fine :)

The process is currently not stuck, it has been up for 4h (longest time yet) I have attached the GDB dump just to make sure I got it right (maybe you can already see something wrong), but it is currently not stuck, I will attack another one when it does get stuck!

gdb_output.txt

marcwehbi commented 2 years ago

okay well I spoke too fast :P it crashed, I believe it happens after many connections happen, the reason it took longer now was because it was night time in the U.S so not many people connected.

Here is the gdb output of a stuck process @paullouisageneau gdb_output.txt

P.S. I am already using 2.1.2 since yesterday Edit2: I am using 2.1.2 on the server only and not the client, could that be why it didnt fix it?

yandeu commented 2 years ago

I am using 2.1.2 on the server only and not the client, could that be why it didnt fix it?

No, I don't think so. The client code hasn't changed in a while and only uses the browser's WebRTC API. But yes, you should keep them in sync.

marcwehbi commented 2 years ago

Some additional info maybe it helps: The issue definitely happens over time, my setup is 1 lobby server (supposed to live forever) and 16 "game servers" which take 8 players, then reset (full app reset) and register against the lobby again. The game servers which reboot after every successful game, never have an issue as they are short lived. The issue only happens with lobby server after a certain ammount of connections were made to it.

The application seems to freeze when a user tries to connect to the geckos.io server, somehow it becomes unresponsive at that point. (I know because in my same app I have a regular https server + geckos server, I can reload the https server and it is responsive, then i try to connect to geckos.io and it doesnt respond, and that point both https and geckos servers are unresponsive. -> I was able to reproduce the exact moment it gets stuck a few times, I cant reproduce it at will but i can be here when it does get stuck. When the application is stuck, and I try to connect, it just loads forever, but rebooting the node js lobby server will accept my "pending connection" as if the request is behind held outside the app waiting for the app to be available.

Finally, as I've said this issue is only happening on 1 of the 2 servers I have, the european lobby isnt having any issue at all (100% same code), I am tempted to keep only the EU lobby for everyone and get rid of the NA one as I'm lost for ideas why its happening :)

Sorry if I wasnt clear, just trying to add details that may help find the root cause of it.

yandeu commented 2 years ago

SiteNote: Why don't you just use WebSockets in the Lobby? And geckos only for the games?

marcwehbi commented 2 years ago

Well, I just swaped the whole instance image to the EU one, if that doesnt work then yeah I'll either use the EU lobby as central one or switch to WSS as you suggested (the reason I didn't is because I was already using geckos for the game network and it worked so well i thought might aswell use it for the lobby too!)

yandeu commented 2 years ago

Maybe if player connects to the lobby on a phone and then switches tab to another website, the webrtc connection stays open for too long?

I just opened chrome://webrtc-internals/ in chrome on android and noticed there were two open webrtc connections to my chat app example.

Maybe your lobby has too many latent connections?

Maybe add a script to deconnect a player after 5 min of inactivity?

Just a theory

marcwehbi commented 2 years ago

Thanks for the ideas :) It isnt latent connections for sure because the lobby will disconnect the play when it finds a server. (also the page isnt available on mobile and Iwas the oinly person connected when it used to crash)

That being said, retoring the image from the EU server seems to have fix it, as the lobby has been running for 12h now.

No idea what it couldve been, but i'm glad its resolved :D

bennlich commented 2 years ago

That is peculiar. Can you think of anything different about the images? OS version?

@paullouisageneau thank you. I'll try this when my server next freezes. It takes much longer for this to happen for me than for @marcwehbi

marcwehbi commented 2 years ago

I really have no idea @bennlich same os (ubuntu 20.04) same node version (14.15) same pm2 version and everything I was using seemed to be the same.

I delayed as much as I could to copy the image to try and find the cause as it sucks to not know what was the issue, for the next region I roll out I will try doing it manually again without an image to see if it happens again.

Thanks none the less for all the ideas and support guys really :)

paullouisageneau commented 2 years ago

@marcwehbi Thank you for the gdb output, it looks like a regression introduced in libdatachannel v0.15.4. I refactored the transports teardown and it looks like it might make the PeerConnection deadlock on close. The timings triggering the deadlock must happen on one machine and not on the other for some reason.

yandeu commented 2 years ago

Just want to share an automated script to install the geckos example on ubuntu 20.04 (AWS EC2).
It will not solve this issue. But maybe it helps anyways :)

Security Group (Firewall)

Protocol	Port range	Source
UDP	1024 - 65535	0.0.0.0/0
TCP	22	MY-IP/24
TCP	3000	0.0.0.0/0

Installation Script

The name of the user is ubuntu.

#!/bin/bash

# tested on ubuntu 20.04 / t3a.nano

sudo apt update && \
sudo apt upgrade -yq && \
sudo apt install cmake -yq && \

# Node.js 16.x (https://github.com/nodesource/distributions/blob/master/README.md#deb)
curl -fsSL https://deb.nodesource.com/setup_16.x | sudo -E bash - && \
sudo apt-get install -y nodejs && \
sudo npm install -g npm@8.1.4 && \

# Install pm2
sudo npm install -g pm2@latest && \

# Install gitget
sudo npm install -g gitget@latest && \

# Navigate home
cd /home/ubuntu && \

# Download Repository
sudo -u ubuntu gitget geckosio/simple-chat-app-example#httpServer && \

# Install
cd simple-chat-app-example && \
sudo -u ubuntu npm install && \

# PM2
sudo -u ubuntu pm2 start npm -- run serve && \
sudo -u ubuntu pm2 save && \
sudo env PATH="$PATH:/usr/bin" /usr/lib/node_modules/pm2/bin/pm2 startup systemd -u ubuntu --hp /home/ubuntu && \
sudo -u ubuntu pm2 save && \

# Finish
cd ../ && \
sudo -u ubuntu echo "done" | sudo tee status.txt && \
sudo shutdown -r now

marcwehbi commented 2 years ago

hi @paullouisageneau glad to hear the cause was found! and thank you @yandeu for the script, definitely helpful :)

bennlich commented 2 years ago

@paullouisageneau woooooo!!! Out of curiosity, what part of the trace was the clue?

paullouisageneau commented 2 years ago

This should be fixed in libdatachannel v0.15.6, this is the PR to update node-datachannel: https://github.com/murat-dogan/node-datachannel/pull/64

@paullouisageneau woooooo!!! Out of curiosity, what part of the trace was the clue?

@bennlich In the trace, a thread waits for a lock somewhere in rtc::impl::PeerConnection::closeTransports()::{lambda()#1}::operator()() while another waits for the first one to finish in rtc::impl::Processor::join(). There is not debug info, but given the scenario, it appears the lock is related to callback synchronization and rightfully held by the second thread. The mistake was that callbacks were reset at the wrong place in closeTransports(), creating the deadlock risk.

yandeu commented 2 years ago

@paullouisageneau Thanks a lot 👍🏻😊🥳

murat-dogan commented 2 years ago

I am also releasing node-datachannel v0.1.13 with new fixes. Thanks!

https://github.com/murat-dogan/node-datachannel/releases/tag/v0.1.13

yandeu commented 2 years ago

@murat-dogan Thanks!

I also just released geckos v2.1.4

bennlich commented 2 years ago

@murat-dogan @paullouisageneau @yandeu thank you all! I just deployed geckos v2.1.4. I was seeing deadlock about once per day before, so I'll wait a couple of days, see what happens, then close this bad boy.

paullouisageneau commented 2 years ago

@yandeu @murat-dogan Thanks!

@bennlich Excellent, keep us updated, I hope the issue is solved.

bennlich commented 2 years ago

Haven't seen any deadlocks in the last 4 days! Looks fixed!

yandeu commented 2 years ago

Nice!

Btw, made a great Docker example yesterday I want to share with you. https://github.com/geckosio/docker-example

geckosio / geckos.io

Geckos 2.0 http server hangs after a while #137

Security Group (Firewall)

Installation Script