Closed skakee closed 4 years ago
Thank you for the testing! What are you using for all the connections? are they stable? TCP only or SSH?
Hi @skakee ! Thanks for the report. Out of curiosity, do you have an actual use-case for more than 1,000 sessions ? We'd love to hear about it.
Ispgn, we are an ISP with ~3K routers deployed globally. Currently we have 440 routers with 625 ssh connections (some routers run two 2 CPUs and make two connections). These are stable.
jejenone, yes, we will deploy RPKI on all our ~3K routers. :)
I will do more testing this week and report any findings. I quickly posted the initial error in case it's a known issue, without really doing much additional troubleshooting. It was on a Friday afternoon. :)
@skakee thanks for the details! This is great. Definitely not a known issue, but from the trace, it could be a race condition. I'll create a benchmark tool and hope it can reproduce. Does it happen often? If you can reproduce it, could you do a to tcpdump all the TCP sessions to the GoRTR port? I will also need the version of GoRTR used and if possible, the version of your router software.
Here's the scenario... 440 devices with 625 connections, to two servers running gortr, that works fine. We then add rpki config with those two servers to 266 additional routers. The routers start making connections, eventually they all connect and shorty after, within minutes, gortr process crashes. We tried increasing open files limit without success.
I could get tcpdump but it would be quite large.
As for the version... "-version" option does not seem to be helpful. [root@rpki01-hhc ~]# /opt/cloudflare/go/bin/gortr -version GoRTR
GoRTR was installed with "go get github.com/cloudflare/gortr/cmd/gortr"
I see, for go get
did you run it after Mar 30, 2020
?
I will try to reproduce on my side.
Yes, go get
is up of now. Also, to verify that it's not one of the routers in the batch of 266 that is causing an issue (as opposed to the sheer number of devices/connections) I created another server (rpki3) and had those routers connect to that server. It's been solid for a few hours, so seems it's the number of connections that is causing the issue.
Update: I was mistaken, the last go get
was run on Feb 17th. I now did update. Will not be able to run a test until tomorrow. Thank you for pointing out the version number, I would not have taken another look believing it was up-to-date.
I'll have a look, it's probably still affected.
Just in case, if you want to automatically add the version based off the git repo, use make build-gortr
: it will compile GoRTR
https://github.com/cloudflare/gortr/blob/master/Makefile#L44-L46
I managed to reproduce the error on the latest version.
GoRTR would raise this error instead of accepting a new connection accept tcp [::]:8282: accept4: too many open files
.
Will make a fix.
I made the fix, it should not crash after ~1000 sessions anymore but will raise an error.
Could you run this updated version? In the root git dir: git pull && git checkout bug/acceptmax && make build-gortr
should work (version should be v0.14.4-4-g60070ff
).
You will need to change the ulimit -n 5000
(verify with ulimit -a
) to allow more files open.
I will merge and make a release soon after.
@lspgn, complied and tried v0.14.4-4-g60070ff. First with open file limit set at 1024 and worked as expected... no crash, just new connections were not being accepted and an error message was logged. Then with upped limit and the process accepted the all the connection. We are up to 1114 connections now. Thanks!
gortr panics at about 1,000 connections with the messages below. Still testing the actual limit and if there are any system resource limits that need to be increased. Will post when have more info.
Running on a single CPU VM, CentOS 8, no changes to the default max system files or other parameters.