XTLS / Xray-core

Xray, Penetrates Everything. Also the best v2ray-core, with XTLS support. Fully compatible configuration.
https://t.me/projectXray
Mozilla Public License 2.0
24.58k stars 3.84k forks source link

gRPC cannot be proxied with dialerProxy to freedom #2232

Closed HirbodBehnam closed 1 year ago

HirbodBehnam commented 1 year ago

Hello I found a quite strange bug in gRPC transport. Take a look at this config:

{
  "log": {
    "loglevel": "debug"
  },
  "inbounds": [
    {
      "listen": "127.0.0.1",
      "port": "10808",
      "protocol": "socks",
      "settings": {
        "udp": true
      }
    },
    {
      "listen": "127.0.0.1",
      "port": "10809",
      "protocol": "http"
    }
  ],
  "outbounds": [
    {
      "protocol": "trojan",
      "settings": {
        "servers": [
          {
            "address": "example.com",
            "port": 443,
            "password": "pass"
          }
        ]
      },
      "streamSettings": {
        "network": "gun",
        "security": "tls",
        "grpcSettings": {
          "serviceName": "servername"
        },
        "sockopt": {
          "dialerProxy": "direct"
        }
      }
    },
    {
      "protocol": "freedom",
      "tag": "direct"
    }
  ]
}

This config should just route gRPC traffic via a freedom outbound and thus have no difference between adding dialerProxy and not. However, I couldn't get this config to work. If I remove the sockopt from the config it works fine. Logs look like this:

2023/06/20 16:13:41 [Warning] core: Xray 1.8.3 started
2023/06/20 16:13:43 [Info] [2221018126] proxy/socks: TCP Connect request to tcp:[2a00:1450:4001:827::200e]:80
2023/06/20 16:13:43 [Info] [2221018126] app/dispatcher: default route for tcp:[2a00:1450:4001:827::200e]:80
2023/06/20 16:13:45 tcp:127.0.0.1:36732 accepted tcp:[2a00:1450:4001:827::200e]:80
2023/06/20 16:13:49 [Info] [2221018126] transport/internet/grpc: creating connection to tcp:104.21.2.133:443
2023/06/20 16:13:49 [Debug] transport/internet/grpc: using gRPC tun mode service name: `...` stream name: `Tun`
2023/06/20 16:13:49 [Info] [2221018126] transport/internet: redirecting request tcp:104.21.2.133:443 to fragment
2023/06/20 16:13:49 [Info] [2221018126] transport/internet/tcp: dialing TCP to tcp:104.21.2.133:443
2023/06/20 16:13:49 [Debug] transport/internet: dialing to tcp:104.21.2.133:443
2023/06/20 16:13:49 [Info] [2221018126] proxy/freedom: connection opened to tcp:104.21.2.133:443, local endpoint 172.16.0.2:48020, remote endpoint 104.21.2.133:443
2023/06/20 16:13:54 [Info] [2221018126] proxy/trojan: tunneling request to tcp:[2a00:1450:4001:827::200e]:80 via 104.21.2.133:443
multi read transport/internet/grpc/encoding: failed to fetch hunk from gRPC tunnel > rpc error: code = Unavailable desc = error reading from server: EOF
transport/internet/grpc/encoding: failed to send data over gRPC tunnel > EOF
2023/06/20 16:14:36 [Info] [2221018126] app/proxyman/outbound: failed to process outbound traffic > proxy/trojan: connection ends > transport/internet/grpc/encoding: failed to fetch hunk from gRPC tunnel > rpc error: code = Unavailable desc = error reading from server: EOF

And wiresharks shows that my own PC sends a FIN to server: image I tried digging into Xray and Google's gRPC source code with debugging and watching when the pipes get closed but I couldn't figure it out. HOWEVER, I found an alternative way to forward any traffic to a specific outbound. This method involves using dokodemo-door with a specific routing and using the dokedemo-door address as the outbound address. Consider following config file:

{
  "log": {
    "loglevel": "debug"
  },
  "inbounds": [
    {
      "listen": "127.0.0.1",
      "port": "10808",
      "protocol": "socks",
      "settings": {
        "udp": true
      }
    },
    {
      "listen": "127.0.0.1",
      "port": "10809",
      "protocol": "http"
    },
    {
      "listen": "127.0.0.1",
      "port": "28111",
      "protocol": "dokodemo-door",
      "settings": {
        "address": "104.21.2.133",
        "port": 443,
        "network": "tcp"
      },
      "tag": "fragmentedinbound"
    }
  ],
  "outbounds": [
    {
      "protocol": "trojan",
      "settings": {
        "servers": [
          {
            "address": "127.0.0.1",
            "port": 28111,
            "password": "password"
          }
        ]
      },
      "streamSettings": {
        "network": "gun",
        "security": "tls",
        "grpcSettings": {
          "serviceName": "..."
        },
        "tlsSettings": {
          "serverName": "..."
        }
      }
    },
    {
      "protocol": "freedom",
      "settings": {
        "fragment": {
          "length": "1-2",
          "interval": "0-1",
          "packets": "1"
        }
      },
      "tag": "fragment"
    },
    {
      "protocol": "freedom",
      "tag": "direct"
    }
  ],
  "routing": {
    "domainMatcher": "mph",
    "domainStrategy": "IPIfNonMatch",
    "rules": [
      {
        "domain": [
          "regexp:.*\\.ir$",
          "ext:iran.dat:ir",
          "ext:iran.dat:other"
        ],
        "outboundTag": "direct",
        "type": "field"
      },
      {
        "ip": [
          "geoip:private",
          "geoip:ir"
        ],
        "outboundTag": "direct",
        "type": "field"
      },
      {
        "inboundTag": [
          "fragmentedinbound"
        ],
        "outboundTag": "fragment",
        "type": "field"
      }
    ]
  }
}

This is basically the config which I'm currently using to connect. I'm not expecting this to be fixed anytime soon considering that there is a neat workaround.

RPRX commented 1 year ago

感谢你的测试,可能不是 gRPC 内部的问题,而是 Xray 调用 gRPC 的问题

dialerProxy 和 gRPC 传输层是 Xray-core v1.4.0 同时引入的,可能它们之间没适配,你可以检查一下代码,然后发个 PR

RPRX commented 1 year ago

请问修好了吗

ghost commented 1 year ago

请问修好了吗

@RPRX

I checked this with xray v1.8.3 and I can confirm that as @HirbodBehnam mentioned, it doesn't work with gRPC. On the other hand, WS works fine.

cty123 commented 1 year ago

I was able to reproduce the problem, but couldn't figure out the root cause either. It seems to be a real problem, it is dropping the outbound connection somewhere inside freedom proxy.

RPRX commented 1 year ago

@cty123 分享些经验?

RPRX commented 1 year ago

解决这个问题应该不难,插一些 log 看一下哪里断了就发现了

cty123 commented 1 year ago

我已经试过了,最早是在freedom 这里断的https://github.com/XTLS/Xray-core/blob/main/proxy/freedom/freedom.go#L205, 显示的错误就是 use of closed network connection,但是不是很清楚为什么connection会断开。我看了grpc文档https://github.com/grpc/grpc-go, 根据说明打开了所有logging, 服务端这边显示是客户端先关闭的连接,但是客户端的grpc显示关闭的原因是EOF, 所以目前并不知道断连的根本原因。

RPRX commented 1 year ago

以前有个 bug 是若 cancel gRPC 某一子连接的 ctx,会 cancel 整个 gRPC 连接,这个 EOF 可能是类似的原因

RPRX commented 1 year ago

h2 同理,如果 h2 加 dialerProxy 也有这问题,基本上可以确定就是它了

cty123 commented 1 year ago

这我倒是可以试试看

RPRX commented 1 year ago

那就交给你啦,你的反馈很有帮助,“服务端这边显示是客户端先关闭的连接,但是客户端的grpc显示关闭的原因是EOF”这个症状很符合那个 bug,解决办法就是不传原始 ctx,只复制一些关键的信息(若有必要),参考:

https://github.com/XTLS/Xray-core/blob/10d6b065784efd3f33a02d6d5ad2a1fa162ff346/transport/internet/grpc/dial.go#L100-L102

RPRX commented 1 year ago

以前有个 bug 是若 cancel gRPC 某一子连接的 ctx,会 cancel 整个 gRPC 连接,这个 EOF 可能是类似的原因

修正一下这个描述,应该是把原始 ctx 传给了子连接,子连接结束时调用了 cancel,结果整个 gRPC 连接都断开了(表现为断流)

RPRX commented 1 year ago

以前有个 bug 是若 cancel gRPC 某一子连接的 ctx,会 cancel 整个 gRPC 连接,这个 EOF 可能是类似的原因

但这次这个 bug 盲猜符合这个描述,应该是 gRPC 把第一个子连接的 gctx 传给了 dialerProxy,然后这个 gctx 被 cancel。。。

cty123 commented 1 year ago

还真的跟你说的一样,我debug了好几天了都。就是你说的这个地方 https://github.com/XTLS/Xray-core/blob/10d6b065784efd3f33a02d6d5ad2a1fa162ff346/transport/internet/grpc/dial.go#L100-L102 我创了个新的context传进去马上就好了,完美使用

RPRX commented 1 year ago

以前有个 bug 是若 cancel gRPC 某一子连接的 ctx,会 cancel 整个 gRPC 连接,这个 EOF 可能是类似的原因

修正一下这个描述,应该是把原始 ctx 传给了子连接,子连接结束时调用了 cancel,结果整个 gRPC 连接都断开了(表现为断流) 但这次这个 bug 盲猜符合这个描述,应该是 gRPC 把第一个子连接的 gctx 传给了 dialerProxy,然后这个 gctx 被 cancel。。。

再次修正 & 总结:

  1. getGrpcClient 的 ctx 参数是每个被代理连接的 ctx,以前那个 bug 是没有 gctx,而由于 gRPC 只 dial 一次,相当于只认第一条被代理连接的 ctx,它被 cancel 时整条 gRPC 都会断
  2. 有了这次的 bug,我研究了一下这个 gctx 是干啥的,grpc.WithContextDialer 的前身是 grpc.WithDialer,后者有一个参数是 time.Duration,再看代码,所以这个 gctx 只是控制 dial 超时用的,dialerProxy 拿它当 *ray 的 ctx,就断得比上次的 bug 还快
RPRX commented 1 year ago

这么说的话其实 gRPC + dialerProxy 一直都是不可用的状态

我在写增强版 XUDP 时遇到了“只想让原始 ctx 控制 dial 超时,不想让它 cancel Copy,但又想让 outbound 自身的超时策略生效”的极其复杂需求,做了一些尝试,留了两处彩蛋,最终方案是给原始 ctx 标记 TimeoutOnly,并且改造了各个 outbound:https://github.com/XTLS/Xray-core/commit/be23d5d3b741268ef86f27dfcb06389e97447e87

所以直接拿来用就行了,@cty123 你试一下给 gctx 标记 TimeoutOnly,没问题的话发个 PR,记得带上 H2

RPRX commented 1 year ago

我改好了,请测试 https://github.com/XTLS/Xray-core/commit/d92002ad127f64bc1e740cb350eafd693ffadd6d 的 gRPC 能否使用 dialerProxy

原本 H2 应该是没这个问题的,因为它原本是 context.Background(),这次改成了 DialTLSContext,需要测试有没有引入新问题

此前传给 REALITY UClient 的 ctx 实际上没被用到,一直想改,这次顺便改成了 uConn.HandshakeContext(ctx),需要测试有没有引入新问题。但是这个超时时间,以后应当参考浏览器来设置一下,否则 GFW 故意让某次握手超时就精准识别了。

RPRX commented 1 year ago

以前有个 bug 是若 cancel gRPC 某一子连接的 ctx,会 cancel 整个 gRPC 连接,这个 EOF 可能是类似的原因

再再次修正:记混了,不是 gRPC,是 H2 以前出现过的 bug https://github.com/XTLS/Xray-core/issues/289#issuecomment-783337941 ,修复见 https://github.com/XTLS/Xray-core/issues/289#issuecomment-787060604