cloudflare / cloudflared

Cloudflare Tunnel client (formerly Argo Tunnel)
https://developers.cloudflare.com/cloudflare-one/connections/connect-apps/install-and-setup/tunnel-guide
Apache License 2.0
8.97k stars 790 forks source link

socket: too many open files #20

Closed yon2004 closed 2 years ago

yon2004 commented 6 years ago

I am attempting to get cloudflared proxy-dns to work reliably but i have come across this error

time="2018-04-16T10:44:12Z" level=error msg="failed to connect to an HTTPS backend \"https://1.1.1.1/dns-query\"" error="failed to perform an HTTPS request: Post https://1.1.1.1/dns-query: EOF"
time="2018-04-16T10:44:13Z" level=error msg="failed to connect to an HTTPS backend \"https://1.1.1.1/dns-query\"" error="failed to perform an HTTPS request: Post https://1.1.1.1/dns-query: dial tcp 1.1.1.1:443: socket: too many open files"

Looking at netstat there is a large number of sockets stuck in the CLOSE_WAIT state.

tcp      386      0 124.188.xxx.xxx:56366    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56450    1.1.1.1:443             CLOSE_WAIT
tcp      385      0 124.188.xxx.xxx:56389    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56386    1.1.1.1:443             CLOSE_WAIT
tcp      387      0 124.188.xxx.xxx:56463    1.1.1.1:443             CLOSE_WAIT
tcp      387      0 124.188.xxx.xxx:56254    1.1.1.1:443             CLOSE_WAIT
tcp      385      0 124.188.xxx.xxx:56504    1.1.1.1:443             CLOSE_WAIT
tcp      385      0 124.188.xxx.xxx:56454    1.1.1.1:443             CLOSE_WAIT
tcp      387      0 124.188.xxx.xxx:56262    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56532    1.1.1.1:443             CLOSE_WAIT
tcp      387      0 124.188.xxx.xxx:56487    1.1.1.1:443             CLOSE_WAIT
tcp      387      0 124.188.xxx.xxx:56380    1.1.1.1:443             CLOSE_WAIT
tcp      385      0 124.188.xxx.xxx:56286    1.1.1.1:443             CLOSE_WAIT
tcp      387      0 124.188.xxx.xxx:56490    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56475    1.1.1.1:443             CLOSE_WAIT
tcp      387      0 124.188.xxx.xxx:56385    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56312    1.1.1.1:443             CLOSE_WAIT
tcp      387      0 124.188.xxx.xxx:56282    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56387    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56535    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56379    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56401    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56492    1.1.1.1:443             CLOSE_WAIT
tcp      385      0 124.188.xxx.xxx:56528    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56285    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56373    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56518    1.1.1.1:443             CLOSE_WAIT
tcp      385      0 124.188.xxx.xxx:56257    1.1.1.1:443             CLOSE_WAIT
tcp      387      0 124.188.xxx.xxx:56349    1.1.1.1:443             CLOSE_WAIT
tcp      385      0 124.188.xxx.xxx:56345    1.1.1.1:443             CLOSE_WAIT
tcp      387      0 124.188.xxx.xxx:56477    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56321    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56397    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56534    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56468    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56340    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56457    1.1.1.1:443             CLOSE_WAIT
tcp      386      0 124.188.xxx.xxx:56363    1.1.1.1:443             CLOSE_WAIT

I am running this on a ubiquity edge router Lite and after looking at the ulimit is set below.

root@ubnt:~# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1934
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1934
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

cloudflared.log

yon2004 commented 6 years ago

I have submitted a pull request that fixes this. #24

yon2004 commented 6 years ago

Just an update this patch does not fix the underlying issue and further investigation will be required.

nickw444 commented 6 years ago

I'm also encountering this issue running on an ERLite! I self compiled for MIPS64.

rufoa commented 6 years ago

I'm getting the same symptoms on the ER-X (mipsle) - it worked well for 3 hours then started giving this error several times a second:

msg="failed to connect to an HTTPS backend \"https://1.1.1.1/dns-query\"" error="failed to perform an HTTPS request: Post https://1.1.1.1/dns-query: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"

then when the file descriptors were exhausted:

msg="failed to connect to an HTTPS backend \"https://1.1.1.1/dns-query\"" error="failed to perform an HTTPS request: Post https://1.1.1.1/dns-query: dial tcp 1.1.1.1:443: socket: too many open files"

tv3 commented 5 years ago

encountered this bug after adding cloudflared on my usg. It's unusable for now. Is there an eta for the fix?

janisz commented 5 years ago

I'm hitting by this bug on CI20 (mipsle) The quick fix is to periodically restart cloudlfared Using 25a04e0

hacktek commented 4 years ago

This happens with argo tunnels as well and I get it after a couple of days. When I checked netstat while it was happening for the proxied port I could see thousands of kept-alive connections when there were only a couple of users accessing the tunnel. Restarting the tunnel cleared them so it looks like this is going to be something that I do daily otherwise a lot of instability is introduced.

Before tunnel restart / issue happening:

# netstat -ano | grep 8123 | wc -l
1717

After tunnel restart:

# netstat -ano | grep 8123 | wc -l
7
nickvollmar commented 4 years ago

Would you be able to update to release 2019.11.0? We've introduced a couple of commits that should help with this: https://github.com/cloudflare/cloudflared/commit/3a9a0a0d751304519cb544b430d8df8bdb563887 https://github.com/cloudflare/cloudflared/commit/e14ec1a1fbba4ce65edddb7961643017e1014ab8

hacktek commented 4 years ago

That sounds good, I'm currently running 2019.10.4.

Let me give 2019.11.0 a shot.

EDIT: I tried it and connection usage appears to be much better now. Here's the same count of total connections on the local port across time taken every 5 seconds:

# while [ 1 ]; do netstat -ano | grep 8123 | wc -l; sleep 5; done
14
13
13
27
27
25
62
61
62
62
62
86
84
83
83
72
72
71
66
35
34
35
35
13
13
17
86
88
113
115
115
116
116
116
116
115
114
123
60
57
34
33
31
34
34
35
34
34
34
26
28
24
24
23
23
21
29
27
27
fletchto99 commented 4 years ago

I'm having this issue as well and restarting cloudflared didn't seem to fix the issue. I wonder if its related to https://github.com/cloudflare/cloudflared/issues/91

Everything was working for the past few weeks, I rebooted my USG and then all of the errors started pouring in.

On a sidenote, are there official mips builds now?

tv3 commented 4 years ago

Gave up on this product. Now using dnscrypt on USG. Works like a charm. https://dnscrypt.info https://community.ui.com/questions/DNSCrypt-Proxy-2/0d4eaa80-aff7-49dd-9dc4-7d0378afd288

hacktek commented 4 years ago

Just to report that this continues to happen on v2019.12.0.

cloudflared[7047]: time="2020-01-15T17:37:09-06:00" level=error msg="HTTP request error" error="Error proxying request to origin: dial tcp 127.0.0.1:8000: socket: too many open files

It again seems like the service just leaves a bunch of open sockets forever until some ulimit is reached:

# while [ 1 ]; do netstat -ano | grep '8000' | wc -l; sleep 2; done
1996
1995
1995
1995
^C

After restarting cloudflared:

# while [ 1 ]; do netstat -ano | grep '8000' | wc -l; sleep 2; done
7
7
7
7
7
^C

Off to schedule a daily service restart...

immortalx commented 4 years ago

This is not a solution available for everyone but i was able to do resolve this by putting nginx as a proxy between cloudflared and my app back end. Instead of thousands of connections and the occasional open files limit error, everything is working fine now.