SagerNet / sing-box

The universal proxy platform
https://sing-box.sagernet.org/
Other
18.88k stars 2.25k forks source link

[Android] WireGuard does not work properly after network outage and recovery #1415

Closed lisongmin closed 1 month ago

lisongmin commented 8 months ago

Operating system

Android

System version

lineageos 20

Installation type

sing-box for Android Graphical Client

If you are using a graphical client, please provide the version of the client.

1.8.4

Version

No response

Description

The WireGuard works properly on startup. However, after disconnecting and reconnecting the network, it cannot handshake with the server.

Reproduction

  1. Starting the sing-box
  2. Check we can access the WireGuard network by curl http://192.168.2.1, and it works.
  3. Turn off the network, and then turn it back on.
  4. Wait a while, and try to access http://192.168.2.1 again, it does not work.

I dump the traffic on the server(see logs at the end), It seems that server can receive data from sing-box and send data to sing-box.

configuration

{
  "log": { "level": "debug" },
  "dns": {
    "servers": [
      {
        "tag": "home-dns",
        "address": "udp://192.168.6.1",
        "detour": "direct",
        "strategy": "ipv4_only"
      },
      {
        "tag": "wg-dns",
        "address": "udp://192.168.2.6",
        "detour": "go-home",
        "strategy": "ipv4_only"
      },
      {
        "tag": "default-dns",
        "strategy": "ipv4_only",
        "address": "h3://223.5.5.5/dns-query",
        "detour": "direct"
      }
    ],
    "rules": [
      {
        "domain_suffix": [".home.example.com"],
        "wifi_ssid": ["home-dns"],
        "server": "family"
      },
      {
        "domain_suffix": [".home.example.com"],
        "server": "wg-dns"
      }
    ],
    "final": "default-dns"
  },
  "inbounds": [
    {
      "type": "tun",
      "tag": "tun-in",
      "interface_name": "tun0",
      "inet4_address": "172.19.0.1/30",
      "inet6_address": "fdfe:2204:cfab::1/126",
      "mtu": 9000,
      "auto_route": true,
      "strict_route": true,
      "inet4_route_address": ["0.0.0.0/1", "128.0.0.0/1"],
      "inet6_route_address": ["::/1", "8000::/1"],
      "endpoint_independent_nat": false,
      "stack": "system",
      "sniff": true
    }
  ],
  "outbounds": [
    { "type": "direct", "tag": "direct" },
    { "type": "block", "tag": "block" },
    { "type": "dns", "tag": "dns" },
    {
      "type": "wireguard",
      "tag": "go-home",
      "local_address": ["10.249.0.3/32"],
      "private_key": "KNx4llKEZwqB5Q69MMVlFfj+7pVaRIFiw63tkSvblmA=",
      "peers": [
        {
          "server": "home.example.com",
          "server_port": 51802,
          "public_key": "DBjU7sR7/Qx65b6m4IKTAZrjDHBeWsruMyoSpV1ES1U=",
          "allowed_ips": ["192.168.2.0/24", "10.249.0.0/24"]
        }
      ]
    }
  ],
  "route": {
    "final": "direct",
    "auto_detect_interface": true,
    "rules": [
      { "protocol": "dns", "outbound": "dns" },
      {
        "wifi_ssid": ["abc"],
        "ip_cidr": ["192.168.2.0/24", "10.249.0.0/24"],
        "outbound": "direct"
      },
      {
        "ip_cidr": ["192.168.2.0/24", "10.249.0.0/24"],
        "outbound": "go-home"
      }
    ]
  }
}

sing-box logs

sfa.log

tcpdump on server

before network disconnect

21:37:53.245737 pppoe-wan In  IP 180.139.224.173.24192 > 124.227.226.83.51802: UDP, length 96
21:37:53.248727 pppoe-wan In  IP 180.139.224.173.24192 > 124.227.226.83.51802: UDP, length 96
21:37:53.249471 pppoe-wan Out IP 124.227.226.83.51802 > 180.139.224.173.24192: UDP, length 96
21:37:53.279670 pppoe-wan In  IP 180.139.224.173.24192 > 124.227.226.83.51802: UDP, length 96
21:38:03.482498 pppoe-wan Out IP 124.227.226.83.51802 > 180.139.224.173.24192: UDP, length 32
21:38:03.505715 pppoe-wan In  IP 180.139.224.173 > 124.227.226.83: ICMP 180.139.224.173 udp port 24192 unreachable, length 68

After network recovery

21:38:04.607864 pppoe-wan In  IP 180.139.224.173.24206 > 124.227.226.83.51802: UDP, length 96
21:38:04.608708 pppoe-wan Out IP 124.227.226.83.51802 > 180.139.224.173.24206: UDP, length 96
21:38:05.511612 pppoe-wan In  IP 180.139.224.173.24206 > 124.227.226.83.51802: UDP, length 96
21:38:05.512196 pppoe-wan Out IP 124.227.226.83.51802 > 180.139.224.173.24206: UDP, length 80
21:38:05.581322 pppoe-wan In  IP 180.139.224.173.24206 > 124.227.226.83.51802: UDP, length 96
21:38:05.581891 pppoe-wan Out IP 124.227.226.83.51802 > 180.139.224.173.24206: UDP, length 96
21:38:06.602565 pppoe-wan Out IP 124.227.226.83.51802 > 180.139.224.173.24206: UDP, length 96

Logs

No response

Integrity requirements

hellodword commented 7 months ago

Ran into similar issue on Linux with "auto_detect_interface": true.

Everything works fine before the interfaces' changing:

DEBUG outbound/wireguard[warp]: peer(bmXO…fgyo) - received handshake response
DEBUG outbound/wireguard[warp]: peer(bmXO…fgyo) - sending keepalive packet
INFO router: updated default interface eth0, index 2
DEBUG outbound/wireguard[warp]: routine: receive incoming receive - stopped
DEBUG outbound/wireguard[warp]: peer(bmXO…fgyo) - retrying handshake because we stopped hearing back after 15 seconds
DEBUG outbound/wireguard[warp]: peer(bmXO…fgyo) - sending handshake initiation
DEBUG outbound/wireguard[warp]: peer(bmXO…fgyo) - handshake did not complete after 5 seconds, retrying (try 2)
DEBUG outbound/wireguard[warp]: peer(bmXO…fgyo) - handshake did not complete after 5 seconds, retrying (try 3)
DEBUG outbound/wireguard[warp]: peer(bmXO…fgyo) - handshake did not complete after 5 seconds, retrying (try 4)

All the handshakes fail and never succeed again.

I guess it's a bug. I'll try to find a minimal reproduce.

hellodword commented 7 months ago

With small patch:

diff --git a/outbound/wireguard.go b/outbound/wireguard.go
index 045241f..c08c6b8 100644
--- a/outbound/wireguard.go
+++ b/outbound/wireguard.go
@@ -165,7 +165,10 @@ func (w *WireGuard) Close() error {
 }

 func (w *WireGuard) InterfaceUpdated() {
-   w.device.BindUpdate()
+   err := w.device.BindUpdate()
+   if err != nil {
+       w.logger.Error("InterfaceUpdated ", err)
+   }
    return
 }
INFO router: updated default interface eth0, index 2
DEBUG outbound/wireguard[warp]: routine: receive incoming receive - stopped
ERROR outbound/wireguard[warp]: InterfaceUpdated use of closed network connection
INFO router: updated default interface wlp2s0, index 3
ERROR outbound/wireguard[warp]: InterfaceUpdated use of closed network connection
INFO router: updated default interface eth0, index 2
ERROR outbound/wireguard[warp]: InterfaceUpdated use of closed network connection
nekohasekai commented 7 months ago

Try https://github.com/SagerNet/sing-box/commit/dd52c26ae1bd6751b99d75d315048d71c592f033

hellodword commented 7 months ago

https://github.com/SagerNet/sing-box/commit/dd52c26ae1bd6751b99d75d315048d71c592f033 with v1.8.6 got the same errors, but, my bad, I didn't mention that I'm using detour and system_interface with wireguard outbound:

{
      "detour": "auto:proxy",
      "interface_name": "warp",
      "system_interface": true,
      "tag": "warp",
      "type": "wireguard"
      ...
}

I'm trying to give a minimal reproduce.

hellodword commented 7 months ago
{
  "inbounds": [
    {
      "listen": "0.0.0.0",
      "listen_port": 1080,
      "type": "mixed"
    }
  ],
  "log": {
    "disabled": false,
    "level": "trace",
    "timestamp": true
  },
  "outbounds": [
    {
      "tag": "warp",
      "detour": "proxy",
      "system_interface": false,
      "type": "wireguard",
      ???
    },
    {
      "tag": "proxy",
      "type": "vmess",
      ???
    }
  ],
  "route": {
    "auto_detect_interface": true,
    "final": "warp"
  }
}

I have two network interfaces eth0 and wlp2s0, I can reproduce the errors with making the eth0 plugged and unplugged.

Dr4tez commented 6 months ago

Similar problem. When I enable Wireguard in Sing-Box on my Android phone outside of my home via mobile internet, it works fine. When I come home and my phone connects to my home Wi-Fi, the Internet on my phone disappears, and in order to get it back I have to shut down the Sing-Box. Sing-Box version 1.8.8 and Android 14. Upd. I checked on 1.9.0-beta.8 - the same problem exists.

jwfang commented 6 months ago

I think these are caused by incorrect/stale bound/connected UDP socket.

Currently WireGuard transport creates and connects the underlying UDP socket on start, and uses the same UDP socket for subsequent send/recv. When connected, this UDP socket will bind to a local IP and port.

After network change/recovery, the host's IP address will change, and this UDP socket's local IP address is no long available. The socket API doesn't give any error for UDP on this socket, so it will seem sending successfully (althouth the packet may or may not arrive at the destation) and will receive nothing afterward.

This undetected dead UDP socket also cause problems for IPv6. Some ISP will change your prefix periodically, the host's IPv6 address will change and kill the previously bound UDP socket. And during startup, when the IPv6 address is in tentative state, the connect will succeed but bind to a link local IPv6 address, which also leave a dead socket.

I think the above conditions can be simulated by manually delete/change host's ( bound UDP socket's ) local IP address and tested using docker/netcat.

If we can't easily detect this, maybe we can just recreate/reconnect the UDP socket if haven't received anything for a specific duration.

pierre-primary commented 6 months ago

After network changes/restoration, as well as when using the Clash API to disconnect all connections, the same situation occurs where the WireGuard connection fails to automatically restore.

My WireGuard configuration with an upstream, deployed on a side Linux device (LXC container in Proxmox ).

BehradJi commented 6 months ago

I had the same issue on Android 14 with Sing-Box 1.8.9. While setting "gso":true in the Wireguard outbound configuration fixed the connection drop after switching networks, it now takes about 30 seconds to come back online.

Dr4tez commented 6 months ago

I had the same issue on Android 14 with Sing-Box 1.8.9. While setting "gso":true in the Wireguard outbound configuration fixed the connection drop after switching networks, it now takes about 30 seconds to come back online.

Thanks for the tip, it worked for me! There are no more wireguard connection drops when moving from one network to another. In any case, it is not noticeable at all, not 30 seconds, not even one second. Android 14, arm64-v8a and 1.9.0-beta.16.

nekohasekai commented 6 months ago

Try f61b272cbf3732ac7d8307ee787963ba78ca5945

hellodword commented 6 months ago

https://github.com/SagerNet/sing-box/commit/f61b272cbf3732ac7d8307ee787963ba78ca5945 works for me, with 1.8.9

03:24:39 INFO router: updated default interface wlp2s0, index 3
03:24:39 DEBUG outbound/wireguard[warp]: routine: receive incoming receive - stopped
03:24:39 DEBUG outbound/wireguard[warp]: udp bind has been updated
03:24:39 DEBUG outbound/wireguard[warp]: routine: receive incoming receive - started
03:24:39 INFO outbound/vmess[proxy-1]: outbound packet connection to 162.159.192.1:2408
03:25:03 DEBUG outbound/wireguard[warp]: peer(bmXO…fgyo) - sending handshake initiation
03:25:03 DEBUG outbound/wireguard[warp]: peer(bmXO…fgyo) - received handshake response
03:25:03 DEBUG outbound/wireguard[warp]: peer(bmXO…fgyo) - sending keepalive packet
03:25:20 DEBUG outbound/wireguard[warp]: peer(bmXO…fgyo) - retrying handshake because we stopped hearing back after 15 seconds
03:25:20 DEBUG outbound/wireguard[warp]: peer(bmXO…fgyo) - sending handshake initiation
03:25:20 DEBUG outbound/wireguard[warp]: peer(bmXO…fgyo) - received handshake response
03:25:20 DEBUG outbound/wireguard[warp]: peer(bmXO…fgyo) - sending keepalive packet
03:25:21 INFO router: updated default interface eth0, index 2
03:25:21 DEBUG outbound/wireguard[warp]: routine: receive incoming receive - stopped
03:25:21 DEBUG outbound/wireguard[warp]: udp bind has been updated
03:25:21 DEBUG outbound/wireguard[warp]: routine: receive incoming receive - started
03:25:21 INFO outbound/vmess[proxy-1]: outbound packet connection to 162.159.192.1:2408
03:25:36 DEBUG outbound/wireguard[warp]: peer(bmXO…fgyo) - sending keepalive packet
03:25:37 INFO router: updated default interface wlp2s0, index 3
03:25:37 DEBUG outbound/wireguard[warp]: routine: receive incoming receive - stopped
03:25:37 DEBUG outbound/wireguard[warp]: udp bind has been updated
03:25:37 DEBUG outbound/wireguard[warp]: routine: receive incoming receive - started
03:25:37 INFO outbound/vmess[proxy-1]: outbound packet connection to 162.159.192.1:2408
Dr4tez commented 6 months ago

On versions 1.8.10 - 1.10.0-alfa.6 the application interface stops responding to actions with it after switching from wifi to mobile Internet if the configuration has active wireguard outbounds without "gso": true. Android 14, arm64-v8a. Upd. On version 1.10.0-alfa.7 the above bug completely disappeared. Thanks and glory to the developer!

Dondrejohnson5 commented 5 months ago

Thanks for those tips man; "gso": true really does work for me! Gosh I've had this problem with sing-box forever ago and always wondered if it was just me

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 5 days

hellodword commented 1 month ago

I think this issue hasn't been fully resolved, as it still happens occasionally. I've tried to reproduce it, but it's quite challenging.

@nekohasekai Please keep this issue open.