SagerNet / sing-box

The universal proxy platform
https://sing-box.sagernet.org/
Other
18.82k stars 2.25k forks source link

Memory usage keeps slowly increase when using vless tls reality inbound #690

Closed freakinyy closed 1 year ago

freakinyy commented 1 year ago

Welcome

Description of the problem

It seems like there is no upper limit on memory usage. This test is done on a clean server in which almost nothing but sing-box run. top: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7427 root 20 0 3145804 556368 7540 S 0.3 56.7 9:18.61 sing-box

free: total used free shared buff/cache available Mem: 957Mi 683Mi 162Mi 0.0Ki 110Mi 153Mi Swap: 2.0Gi 686Mi 1.3Gi

pprof: profile001.pdf

Version of sing-box

```console $ sing-box version sing-box version 1.3.0 Environment: go1.20.5 linux/amd64 Tags: with_gvisor,with_quic,with_dhcp,with_wireguard,with_utls,with_reality_server,with_clash_api Revision: e482053c8a01fe1d3f64ea4599d1896ca3c73298 CGO: enabled ```

Server and client configuration file

```console Server: { "log": { "disabled": false, "level": "warn", "output": "", "timestamp": true }, "inbounds": [ { "type": "vless", "tag": "in-vless", "listen": "::", "listen_port": 443, "tcp_fast_open": true, "udp_timeout": 30, "users": [ { "uuid": "......", "flow": "xtls-rprx-vision" } ], "tls": { "enabled": true, "server_name": "......", "reality": { "enabled": true, "handshake": { "server": "......", "server_port": 443 }, "private_key": "......", "short_id": [ "......" ], "max_time_difference": "1m" } } } ], "outbounds": [ { "type": "direct", "tag": "out-direct" } ] } Client: { "log": { "disabled": false, "level": "warn", "timestamp": true }, "route": { "final": "out-vless", "default_mark": 255 }, "inbounds": [ { "type": "tproxy", "tag": "in-tproxy", "listen": "127.0.0.1", "listen_port": 3333, "tcp_fast_open": true, "sniff": true }, { "type": "tproxy", "tag": "in-tproxy6", "listen": "::1", "listen_port": 3333, "tcp_fast_open": true, "sniff": true } ], "outbounds": [ { "type": "vless", "tag": "out-vless", "server": "127.0.0.1", "server_port": 1181, "uuid": "......", "flow": "xtls-rprx-vision", "tls": { "enabled": true, "disable_sni": false, "server_name": "......", "insecure": false, "utls": { "enabled": true, "fingerprint": "......" }, "reality": { "enabled": true, "public_key": "......", "short_id": "......" } }, "packet_encoding": "xudp" }, { "type": "direct", "tag": "out-direct" } ] } ```

Server and client log file

```console Since sing-box must be run hours to get pprof result, I cannot paste all info level log here. Log was set to warning level. There are 10 thousands error logs in about 8-9 hours which can be mainly divided into three categories: 1. inbound/vless[in-vless]: process connection from xxxx:xx: dial tcp xxxx:xx: connect: connection refused 2. inbound/vless[in-vless]: process connection from xxxx:xx: dial tcp xxxx:xx: connect: no route to host 3. inbound/vless[in-vless]: process connection from xxxx:xx:: REALITY: processed invalid connection I don't know if these logs make sense. ```
Mahdi-zarei commented 1 year ago

I have the same problem, I tested various guesses and observed an odd occurrence. I changed the source code and added a gocron which ran debug.FreeOSMemory every 10 seconds, and it was obvious that it was working from the htop memory usage, but still the memory was slowly increasing, I had pprof enabled and the samples reported around 1/2 of the memory htop was reporting, given that GC was returning memory to OS I really am confused on how/where the memory is spent.

nekohasekai commented 1 year ago

Since the reality server directly uses code from XTLS/Reality, please check if XRay has the same problem.

Mahdi-zarei commented 1 year ago

Since the reality server directly uses code from XTLS/Reality, please check if XRay has the same problem.

I had previously used xray core with the exact same inbound/outbound configurations on one of my servers and the memory usage was pretty much constant, so I don't think the problem is from reallity part of the code, also another observation I once made is that I completely changed all characteristics of the VLESS reallity inbound, which included UUID, private key and short id, but did not change the ip and port, as a result all connections towards my server were getting errors and were being closed since the uuid and other auth parameters were wrong, yet this caused a much much faster memory usage increase, I assume it was taking like 2 3 minutes until 1 gigabyte of ram was used.

p.n: my current config that has this problem has plain VLESS and Socks inbounds and VLESS + reality outbound (domestic relay), and I have used xray with this config not the one that has VLESS + reality inbound. My main server which has VLESS + reality inbound does have the same problem, but I have not used xray with that config.

freakinyy commented 1 year ago

Since the reality server directly uses code from XTLS/Reality, please check if XRay has the same problem.

I made a test these days. Two servers share the same conifig as as before. One used sing-box and another used xray. The only client was tproxy (sing-box) on a router with config as before. Haproxy was used on client side for load balance with same weight to two server. A few hours later, memory usage of sing-box went much higher than xray. Then I switch two servers' configs, i.e. sing-box was run on the server where xray was run before and xray was run on the server where sing-box was run. And pprof are set in both sing-box and xray. Ahout 24 hours later, memory usage of sing-box goes to hundreds of MBs. The results of pprof are here: sing-box.pdf xray.pdf So I agree with @Mahdi-zarei about

I had previously used xray core with the exact same inbound/outbound configurations on one of my servers and the memory usage was pretty much constant, so I don't think the problem is from reallity part of the code

For this:

also another observation I once made is that I completely changed all characteristics of the VLESS reallity inbound, which included UUID, private key and short id, but did not change the ip and port, as a result all connections towards my server were getting errors and were being closed since the uuid and other auth parameters were wrong, yet this caused a much much faster memory usage increase, I assume it was taking like 2 3 minutes until 1 gigabyte of ram was used.

I'm not sure, but it sounds related to my error log, i.e. errors lead to hight memory usage. Maybe @Mahdi-zarei can open an issue with configs and logs.

Mahdi-zarei commented 1 year ago

I tested something today, I compiled the latest version including the latest commit and also added pprof to it so I can observe it's memory usage, I noticed that the memory is increasing slightly slower than the 1.3.0 release, so I went ahead and added a loop which ran FreeOSMemory every 10 seconds, and the memory usage though high, has not increased indefinitely and is throttling around 200 300 MB.

Screenshot 2023-07-06 205538 memory usage of my domestic relay server

I had a cronjob resetting sing-box every 30 seconds, and around the time I marked I ran the above mentioned version and disabled the cronjob, as you can see the memory is increasing slower AND is being freed on its own ( even though I suspect there are still problems, memory consumption of sing-box is rather odd still ), it is also worth mentioning that without the said changes memory would have sometimes reached the maximum for my server (1 GB) with a 1h cronjob, and now sing-box has been running for 8 hours and everything is still functional.

here is the pprof when systemd was showing around 330 MB of memory used. profile002.pdf

and the config of my server:

```console { "log": { "level": "warn", "timestamp": true }, "dns": { "servers": [ { "tag": "google", "address": "tls://8.8.8.8" }, { "tag": "local", "address": "223.5.5.5", "detour": "direct" }, { "tag": "block", "address": "rcode://success" } ], "rules": [ { "outbound": "any", "server": "local" } ], "strategy": "prefer_ipv4" }, "inbounds": [ { "type": "shadowsocks", "tag": "ss-in", "listen": "0.0.0.0", "listen_port": 50005, "tcp_fast_open": true, "domain_strategy": "prefer_ipv4", "method": "none", "password": "" }, { "type": "vless", "tag": "tehranRelay", "listen": "0.0.0.0", "listen_port": 51349, "tcp_fast_open": true, "domain_strategy": "prefer_ipv4", "users": [ { "name": "nova", "uuid": "29b1457d-67cf-4789-fc6c-0c76b4cced70" } ] }, { "type": "vless", "tag": "mellat", "listen": "0.0.0.0", "listen_port": 59493, "tcp_fast_open": true, "domain_strategy": "prefer_ipv4", "users": [ { "name": "mellat", "uuid": "2f35e1d4-cd34-4e69-e073-8e1c47fafa80" } ] }, { "type": "vless", "tag": "javad", "listen": "0.0.0.0", "listen_port": 41300, "tcp_fast_open": true, "domain_strategy": "prefer_ipv4", "users": [ { "name": "javad", "uuid": "f97bd875-b9d6-41d8-8d21-79ae51c07e25" } ] }, { "type": "mixed", "tag": "mixed-in", "listen": "0.0.0.0", "listen_port": 42500, "tcp_fast_open": true, "domain_strategy": "prefer_ipv4" } ], "outbounds": [ { "type": "hysteria", "tag": "LondonH", "server": "", "server_port": 443, "up": "200 Mbps", "up_mbps": 200, "down": "200 Mbps", "down_mbps": 200, "obfs": "", "auth_str": "", "recv_window_conn": 20971520, "recv_window": 52428800, "tls": { "enabled": true, "server_name": "", "insecure": true, "alpn": [ "NOVA" ] }, "reuse_addr": true, "tcp_fast_open": true }, { "type": "vless", "tag": "LondonR", "server": "", "server_port": 443, "uuid": "e6aaba20-f713-4046-afd0-38c60e510932", "flow": "xtls-rprx-vision", "reuse_addr": true, "tcp_fast_open": true, "tls": { "enabled": true, "server_name": "", "utls": { "enabled": true, "fingerprint": "randomized" }, "reality": { "enabled": true, "public_key": "", "short_id": "" } } }, { "type": "direct", "tag": "direct" }, { "type": "block", "tag": "block" }, { "type": "dns", "tag": "dns-out" } ], "route": { "rules": [ { "protocol": "dns", "outbound": "dns-out" }, { "domain_suffix": [ ".ir" ], "outbound": "direct" }, { "geoip": [ "ir" ], "outbound": "direct" }, { "inbound": [ "ss-in" ], "outbound": "LondonH" }, { "inbound": [ "tehranRelay", "mellat", "javad", "mixed-in" ], "outbound": "LondonR" } ], "auto_detect_interface": true } } ```

the hysteria outbound is not being used much compared to vless one, and the problem was present before I added the hysteria outbound and only had the vless one.

@nekohasekai @freakinyy is there any particular test I can run and share the results to help with this issue?

p.n: I just recalled, my main server is restarting sing-box every 1 hour, and It appears the drop in memory usage is at the same time the instance on the main server restarts, I will disable the cronjob on the main server later and observe the behaviour of the memory usage on both servers, and will comment the results here.

Mahdi-zarei commented 1 year ago

root@ubuntu-g1-small1-simin-1 ~# lsof | grep ESTABLISHED | wc -l 15257 root@ubuntu-g1-small1-simin-1 ~# lsof | grep sing-box | wc -l 25568

Around 20 30 people at most use my vpn, so i doubt this number of connections is normal, I assume there might be a bug where connections are not getting closed when they should, that also explains why restarting the upstream sing-box server causes the release of the memory on the downstream server, and also why the amount of memory that systemd reports is not reported in the pprof report, I guess since the memory is somehow occupied by the connections and kernel is in charge of connections, pprof is not able to track them down even though the occupation is associated with the sing-box.

nekohasekai commented 1 year ago

If you are reporting a reality issue, other types of protocols should not be included in the configuration, including the VLESS subprotocol vision.

freakinyy commented 1 year ago

The abnormal memory usage cannot be reproduced by version 1.3.1-beta.1 using the origin config. It is quite stable under some "stress test". I think there is no need and no way to make further test.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 5 days