cloudflare / gortr

The RPKI-to-Router server used at Cloudflare
https://rpki.cloudflare.com
BSD 3-Clause "New" or "Revised" License
309 stars 39 forks source link

gortr panics at about 1,000 connections. #65

Closed skakee closed 4 years ago

skakee commented 4 years ago

gortr panics at about 1,000 connections with the messages below. Still testing the actual limit and if there are any system resource limits that need to be increased. Will post when have more info.

Running on a single CPU VM, CentOS 8, no changes to the default max system files or other parameters.

May 15 20:19:08 rpki01-hhc gortr[25828]: panic: runtime error: invalid memory address or nil pointer dereference
May 15 20:19:08 rpki01-hhc gortr[25828]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x6b794d]
May 15 20:19:08 rpki01-hhc gortr[25828]: goroutine 13 [running]:
May 15 20:19:08 rpki01-hhc gortr[25828]: github.com/cloudflare/gortr/lib.(*Server).loopTCP(0xc0000ec300, 0xa0a9c0, 0xc000010190, 0x958e16, 0x3, 0xc00ba79f60)
May 15 20:19:08 rpki01-hhc gortr[25828]: #011/opt/cloudflare/go/src/github.com/cloudflare/gortr/lib/server.go:586 +0x28d
May 15 20:19:08 rpki01-hhc gortr[25828]: github.com/cloudflare/gortr/lib.(*Server).StartSSH(0xc0000ec300, 0x7ffe0d0e9d61, 0x5, 0xc0000815f0, 0x0, 0x0)
May 15 20:19:08 rpki01-hhc gortr[25828]: #011/opt/cloudflare/go/src/github.com/cloudflare/gortr/lib/server.go:604 +0x108
May 15 20:19:08 rpki01-hhc gortr[25828]: main.main.func5(0xc0000ec300, 0xc0000815f0)
May 15 20:19:08 rpki01-hhc gortr[25828]: #011/opt/cloudflare/go/src/github.com/cloudflare/gortr/cmd/gortr/gortr.go:700 +0x51
May 15 20:19:08 rpki01-hhc gortr[25828]: created by main.main
May 15 20:19:08 rpki01-hhc gortr[25828]: #011/opt/cloudflare/go/src/github.com/cloudflare/gortr/cmd/gortr/gortr.go:699 +0xac9
May 15 20:19:08 rpki01-hhc systemd[1]: gortr.service: Failed with result 'exit-code'.
lspgn commented 4 years ago

Thank you for the testing! What are you using for all the connections? are they stable? TCP only or SSH?

jejenone commented 4 years ago

Hi @skakee ! Thanks for the report. Out of curiosity, do you have an actual use-case for more than 1,000 sessions ? We'd love to hear about it.

skakee commented 4 years ago

Ispgn, we are an ISP with ~3K routers deployed globally. Currently we have 440 routers with 625 ssh connections (some routers run two 2 CPUs and make two connections). These are stable.

jejenone, yes, we will deploy RPKI on all our ~3K routers. :)

I will do more testing this week and report any findings. I quickly posted the initial error in case it's a known issue, without really doing much additional troubleshooting. It was on a Friday afternoon. :)

lspgn commented 4 years ago

@skakee thanks for the details! This is great. Definitely not a known issue, but from the trace, it could be a race condition. I'll create a benchmark tool and hope it can reproduce. Does it happen often? If you can reproduce it, could you do a to tcpdump all the TCP sessions to the GoRTR port? I will also need the version of GoRTR used and if possible, the version of your router software.

skakee commented 4 years ago

Here's the scenario... 440 devices with 625 connections, to two servers running gortr, that works fine. We then add rpki config with those two servers to 266 additional routers. The routers start making connections, eventually they all connect and shorty after, within minutes, gortr process crashes. We tried increasing open files limit without success.

I could get tcpdump but it would be quite large.

As for the version... "-version" option does not seem to be helpful. [root@rpki01-hhc ~]# /opt/cloudflare/go/bin/gortr -version GoRTR

GoRTR was installed with "go get github.com/cloudflare/gortr/cmd/gortr"

lspgn commented 4 years ago

I see, for go get did you run it after Mar 30, 2020? I will try to reproduce on my side.

skakee commented 4 years ago

Yes, go get is up of now. Also, to verify that it's not one of the routers in the batch of 266 that is causing an issue (as opposed to the sheer number of devices/connections) I created another server (rpki3) and had those routers connect to that server. It's been solid for a few hours, so seems it's the number of connections that is causing the issue.

skakee commented 4 years ago

Update: I was mistaken, the last go get was run on Feb 17th. I now did update. Will not be able to run a test until tomorrow. Thank you for pointing out the version number, I would not have taken another look believing it was up-to-date.

lspgn commented 4 years ago

I'll have a look, it's probably still affected. Just in case, if you want to automatically add the version based off the git repo, use make build-gortr: it will compile GoRTR https://github.com/cloudflare/gortr/blob/master/Makefile#L44-L46

lspgn commented 4 years ago

I managed to reproduce the error on the latest version. GoRTR would raise this error instead of accepting a new connection accept tcp [::]:8282: accept4: too many open files. Will make a fix.

lspgn commented 4 years ago

I made the fix, it should not crash after ~1000 sessions anymore but will raise an error. Could you run this updated version? In the root git dir: git pull && git checkout bug/acceptmax && make build-gortr should work (version should be v0.14.4-4-g60070ff). You will need to change the ulimit -n 5000 (verify with ulimit -a) to allow more files open. I will merge and make a release soon after.

skakee commented 4 years ago

@lspgn, complied and tried v0.14.4-4-g60070ff. First with open file limit set at 1024 and worked as expected... no crash, just new connections were not being accepted and an error message was logged. Then with upped limit and the process accepted the all the connection. We are up to 1114 connections now. Thanks!

mahtin commented 4 years ago

@skakee - there's a few extra notes on the setting of #openfiles on this page.