alibaba / higress

🤖 AI Gateway | AI Native API Gateway
https://higress.io
Apache License 2.0
3.26k stars 516 forks source link

proxy js file failed #795

Open ray1888 opened 10 months ago

ray1888 commented 10 months ago

If you are reporting any crash or any potential security issue, do not open an issue in this repo. Please report the issue via ASRC(Alibaba Security Response Center) where the issue will be triaged appropriately.

Ⅰ. Issue Description

the problem is, when i trying higress as our team new ingress gateway, it proxy some of the Nginx's service js file return with empty file which size is 0

Ⅱ. Describe what happened

5f5f2d163afb619915fc8480ecae26d

gateway log as above, and chrome console report

c299a6140c3fccd76228bdca5d0b858

which refer to the request for js resource doesn't response with valid body, because of body_sent log is 0

Ⅲ. Describe what you expected to happen below pic is directly request throught svc nodeport, it response greatly and with no console error 7ac507b7887f04b16b03ceeb74f251e

Ⅳ. How to reproduce it (as minimally and precisely as possible)

  1. xxx
  2. xxx
  3. xxx

Ⅴ. Anything else we need to know?

Ⅵ. Environment:

CH3CHO commented 10 months ago

Could you provide corresponding route configurations? Do you have any WASM plugin eanbled for this route?

ray1888 commented 10 months ago

throught Higress CURL Request: curl 'http://log.gitee.work/assets/login/js/chunk-vendors.705a060b.js' \ -H 'Accept: /' \ -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' \ -H 'Cache-Control: no-cache' \ -H 'Connection: keep-alive' \ -H 'Pragma: no-cache' \ -H 'Referer: http://log.gitee.work/login' \ -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' \ --compressed \ --insecure

throught nodeport CURL request: curl 'http://log.gitee.work:20571/assets/login/js/chunk-vendors.705a060b.js' \ -H 'Accept: /' \ -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' \ -H 'Cache-Control: no-cache' \ -H 'Connection: keep-alive' \ -H 'Pragma: no-cache' \ -H 'Referer: http://log.gitee.work:20571/login' \ -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' \ --compressed \ --insecure

ray1888 commented 10 months ago

Could you provide corresponding route configurations? Do you have any WASM plugin eanbled for this route?

Sure,

  1. ROute configurtion are belows image

svc for nodePort image

  1. No WASM Plugin enabled in this route config
johnlanni commented 10 months ago

@ray1888 The Higress log field response_code is 200, indicating that the response is returned by the upstream service. The error is net::ERR_CONTENT_LENGTH_MISMATCH. It may be that your upstream service did not return a complete response.

johnlanni commented 10 months ago

Did you use nginx as the upstream ?

cc https://github.com/xhlwill/blog/issues/17#issuecomment-848631589

ray1888 commented 10 months ago

Did you use nginx as the upstream ?

cc xhlwill/blog#17 (comment)

yes, the upstream svc is a nginx

ray1888 commented 10 months ago

image

Did you use nginx as the upstream ?

cc xhlwill/blog#17 (comment) image i have check the nginx log, but it seem to be normal in nginx part

johnlanni commented 10 months ago

throught Higress CURL Request: curl 'http://log.gitee.work/assets/login/js/chunk-vendors.705a060b.js' -H 'Accept: /' -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' -H 'Cache-Control: no-cache' -H 'Connection: keep-alive' -H 'Pragma: no-cache' -H 'Referer: http://log.gitee.work/login' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' --compressed --insecure

throught nodeport CURL request: curl 'http://log.gitee.work:20571/assets/login/js/chunk-vendors.705a060b.js' -H 'Accept: /' -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' -H 'Cache-Control: no-cache' -H 'Connection: keep-alive' -H 'Pragma: no-cache' -H 'Referer: http://log.gitee.work:20571/login' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' --compressed --insecure

I found the upstream host is 10.244.18.92:8080 in your log. Try this one:

curl 'http://10.244.18.92:8080/assets/login/js/chunk-vendors.705a060b.js' -H 'Host: log.gitee.work' -H 'Accept: /' -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' -H 'Cache-Control: no-cache' -H 'Connection: keep-alive' -H 'Pragma: no-cache' -H 'Referer: http://log.gitee.work:20571/login' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' --compressed --insecure

ray1888 commented 10 months ago

throught Higress CURL Request: curl 'http://log.gitee.work/assets/login/js/chunk-vendors.705a060b.js' -H 'Accept: /' -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' -H 'Cache-Control: no-cache' -H 'Connection: keep-alive' -H 'Pragma: no-cache' -H 'Referer: http://log.gitee.work/login' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' --compressed --insecure throught nodeport CURL request: curl 'http://log.gitee.work:20571/assets/login/js/chunk-vendors.705a060b.js' -H 'Accept: /' -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' -H 'Cache-Control: no-cache' -H 'Connection: keep-alive' -H 'Pragma: no-cache' -H 'Referer: http://log.gitee.work:20571/login' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' --compressed --insecure

I found the upstream host is 10.244.18.92:8080 in your log. Try this one:

curl 'http://10.244.18.92:8080/assets/login/js/chunk-vendors.705a060b.js' -H 'Host: log.gitee.work' -H 'Accept: /' -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' -H 'Cache-Control: no-cache' -H 'Connection: keep-alive' -H 'Pragma: no-cache' -H 'Referer: http://log.gitee.work:20571/login' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' --compressed --insecure

in kubernete cluster ,curl response is normal image

ray1888 commented 10 months ago

throught Higress CURL Request: curl 'http://log.gitee.work/assets/login/js/chunk-vendors.705a060b.js' -H 'Accept: /' -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' -H 'Cache-Control: no-cache' -H 'Connection: keep-alive' -H 'Pragma: no-cache' -H 'Referer: http://log.gitee.work/login' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' --compressed --insecure throught nodeport CURL request: curl 'http://log.gitee.work:20571/assets/login/js/chunk-vendors.705a060b.js' -H 'Accept: /' -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' -H 'Cache-Control: no-cache' -H 'Connection: keep-alive' -H 'Pragma: no-cache' -H 'Referer: http://log.gitee.work:20571/login' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' --compressed --insecure

I found the upstream host is 10.244.18.92:8080 in your log. Try this one: curl 'http://10.244.18.92:8080/assets/login/js/chunk-vendors.705a060b.js' -H 'Host: log.gitee.work' -H 'Accept: /' -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' -H 'Cache-Control: no-cache' -H 'Connection: keep-alive' -H 'Pragma: no-cache' -H 'Referer: http://log.gitee.work:20571/login' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' --compressed --insecure

in kubernete cluster ,curl response is normal image

@johnlanni so i think it 's not the problem of nginx, i have check the pod , pod ip is 10.244.18.92, direct request can get normal response ,and even throught kubernetes nodeport svc also get normal response body that as well

johnlanni commented 10 months ago

@ray1888 Please run the tcpdump in higress-gateway's pod (maybe you should run this command on the node, or switch the pod user to root):

tcpdump -i any host 10.244.18.92 and port 8080 -A

Then access the js from browser, you will see the whole request headers in the output of tcpdump.

Then try to curl 10.244.18.92:8080 with these request headers, and find that which header will let the nginx return 200 but without response body.

ray1888 commented 10 months ago

@ray1888 Please run the tcpdump in higress-gateway's pod (maybe you should run this command on the node, or switch the pod user to root):

tcpdump -i any host 10.244.18.92 and port 8080 -A

Then access the js from browser, you will see the whole request headers in the output of tcpdump.

Then try to curl 10.244.18.92:8080 with these request headers, and find that which header will let the nginx return 200 but without response body.

can i do the tcpdump at the gateway node? or only can do on the pod working node?

ray1888 commented 10 months ago

@johnlanni can this screenshot ok? or need i post with pcap file? image

ray1888 commented 10 months ago

@ray1888 Please run the tcpdump in higress-gateway's pod (maybe you should run this command on the node, or switch the pod user to root):

tcpdump -i any host 10.244.18.92 and port 8080 -A

Then access the js from browser, you will see the whole request headers in the output of tcpdump.

Then try to curl 10.244.18.92:8080 with these request headers, and find that which header will let the nginx return 200 but without response body. @johnlanni i had tried to delete header one by one, and none of them effect to the response with curl to 10.244.18.92:8080 as dest

ray1888 commented 10 months ago

@johnlanni And i also try to curl in the higress gateway to dest svc js file, it also ok image

but it still can't get from throught the higress to user browser response

johnlanni commented 10 months ago

@johnlanni can this screenshot ok? or need i post with pcap file? image

Did you use curl try the headers of the output?

From the output of tcpdump, you can also find that nginx did not return the response body, which can prove that it was caused by nginx and not higress that discarded the response body.

ray1888 commented 10 months ago

@johnlanni can this screenshot ok? or need i post with pcap file? image

Did you use curl try the headers of the output?

From the output of tcpdump, you can also find that nginx did not return the response body, which can prove that it was caused by nginx and not higress that discarded the response body.

yes, i have try that with curl to nginx , delete header one by one doesn't effect the response.

ray1888 commented 10 months ago

@johnlanni i have some new clue, i do curl from kubernetes node to hostname , the curl result are below image i don't understand why it close the connection before the data response? this is traffic throught higress gateway

johnlanni commented 10 months ago

@ray1888 As you can see from the tcpdump output above, nginx did not return the response body. I think we need to find out the reason first.

ray1888 commented 10 months ago

@ray1888 As you can see from the tcpdump output above, nginx did not return the response body. I think we need to find out the reason first. image

i directly curl from node will not close connectin before data tranfer finish. will it be request time over higress default timeout ?

ray1888 commented 10 months ago

@ray1888 As you can see from the tcpdump output above, nginx did not return the response body. I think we need to find

@ray1888 As you can see from the tcpdump output above, nginx did not return the response body. I think we need to find out the reason first. image

i directly curl from node will not close connectin before data tranfer finish. from above ,it prove that , it not nginx problem? will it be request time over higress default timeout ?@johnlanni

johnlanni commented 10 months ago

@ray1888 The response code is 200, and if it times out, the response code is 504.

ray1888 commented 9 months ago

@ray1888 The response code is 200, and if it times out, the response code is 504.

can it remote on dingding to assist to help

johnlanni commented 9 months ago

@ray1888 可以的,你可以在钉钉社区交流群找到我,昵称是澄潭

ray1888 commented 9 months ago

@johnlanni 问了一下公司策略,不能远程访问,但是我把tcpdump 命令输出到文件里面,然后给导出来了。 response.zip 里面有较多的ReAck 和 Dup ack,而且比较奇怪的是,这里我理解应该代理的是7层的协议?但是tcpdump 出来的都只有4层的协议,到目标Pod 8080端口的那些请求和返回 image

而且从下图看得出来,是多次网关请求js时,重复ack,导致不断重传,重传多次失败后,网关侧主动发送RST 重置了链接导致的

image

CH3CHO commented 9 months ago

网关 Pod 和 Nginx 之间都经过了什么额外的节点呢?

ray1888 commented 9 months ago

网关 Pod 和 Nginx 之间都经过了什么额外的节点呢? flow 网关Pod 在Node2上, Node1 为对应域名解析的节点

ray1888 commented 9 months ago

but i also try apisix for proxy with same config, they didn't show the same issue

johnlanni commented 9 months ago

@ray1888 我看了抓包有大量的TCP重传,包括建立连接时就出现了,这个问题难道只出现在js请求上?其他请求不受影响?,这是higress往后端发送的请求,你在higress gateway的pod里用curl命令测试一下看:

GET /assets/login/js/chunk-ff542364.2fa1ed4b.js HTTP/1.1
host: log.gitee.work
accept-encoding: deflate, gzip
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
accept-language: zh-CN,zh;q=0.9,en;q=0.8
cache-control: no-cache
pragma: no-cache
purpose: prefetch
referer: http://log.gitee.work/login
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
x-forwarded-for: 10.244.25.64
x-forwarded-proto: http
x-envoy-internal: true
x-request-id: 99ef8214-31bc-40a4-8062-802aa1051eef
x-envoy-decorator-operation: gitee-one-front.gitee.svc.cluster.local:80/assets/*
x-envoy-expected-rq-timeout-ms: 3000
x-envoy-attempt-count: 1
x-b3-traceid: 8ee55d9ddd763fc3b66f99e52a34a6d2
x-b3-spanid: b66f99e52a34a6d2
x-b3-sampled: 0
req-start-time: 1705976499465
original-host: log.gitee.work
ray1888 commented 9 months ago

@ray1888 我看了抓包有大量的TCP重传,包括建立连接时就出现了,这个问题难道只出现在js请求上?其他请求不受影响?,这是higress往后端发送的请求,你在higress gateway的pod里用curl命令测试一下看:

GET /assets/login/js/chunk-ff542364.2fa1ed4b.js HTTP/1.1
host: log.gitee.work
accept-encoding: deflate, gzip
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
accept-language: zh-CN,zh;q=0.9,en;q=0.8
cache-control: no-cache
pragma: no-cache
purpose: prefetch
referer: http://log.gitee.work/login
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
x-forwarded-for: 10.244.25.64
x-forwarded-proto: http
x-envoy-internal: true
x-request-id: 99ef8214-31bc-40a4-8062-802aa1051eef
x-envoy-decorator-operation: gitee-one-front.gitee.svc.cluster.local:80/assets/*
x-envoy-expected-rq-timeout-ms: 3000
x-envoy-attempt-count: 1
x-b3-traceid: 8ee55d9ddd763fc3b66f99e52a34a6d2
x-b3-spanid: b66f99e52a34a6d2
x-b3-sampled: 0
req-start-time: 1705976499465
original-host: log.gitee.work

是的,只有JS有影响,其他的包括API转发都没问题

johnlanni commented 9 months ago

@ray1888 higress不会识别特定响应做处理,从抓包看,higress和后端之间从建连开始就一直有丢包重传,以及tcp包乱序等问题,跟你们的网络环境关系比较大(这个js响应比较大,可能触发了网络的问题),你可以尝试在你本地笔记本用kind部署一套higress,把原样的响应返回测试下,应该不会复现这个问题。

johnlanni commented 9 months ago

image 看了下这个不是正常的重传,SYN包没有等RTO(默认最小200毫秒)时间就重传了,只等了0.02毫秒,后面的数据包也是类似的情况,不是太久没收到ACK后才重传,而是一个包直接在网络上传2次。

johnlanni commented 9 months ago

你可以再抓一个浏览器和higress之间的包给我看下么,这个错误也很奇怪: net::ERR_INCOMPLETE_CHUNKED_ENCODING

ray1888 commented 9 months ago

你可以再抓一个浏览器和higress之间的包给我看下么,这个错误也很奇怪: net::ERR_INCOMPLETE_CHUNKED_ENCODING

我先试试。晚点我这边再部署一个Kind,然后通过Nodeport访问JS那个服务试试

ray1888 commented 9 months ago

你可以再抓一个浏览器和higress之间的包给我看下么,这个错误也很奇怪: net::ERR_INCOMPLETE_CHUNKED_ENCODING

browser.zip

ray1888 commented 9 months ago

image 看了下这个不是正常的重传,SYN包没有等RTO(默认最小200毫秒)时间就重传了,只等了0.02毫秒,后面的数据包也是类似的情况,不是太久没收到ACK后才重传,而是一个包直接在网络上传2次。

这里部分我刚刚看了一下啊,calico这边的RTO很小。tcpdump 出来的pod 是calico 上面的 pod ip,不确定是否会有影响呢? image

johnlanni commented 3 months ago

是不是开启了WAF插件或者其他会buffer响应body的插件,最近有个用户遇到类似问题,是WAF插件导致的。

可以调高全局参数的 downstream.connectionBufferLimits 解决

WAF 插件会缓存请求 Body 和响应 Body,如果 Body 比全局配置中的 downstream.connectionBufferLimits 配置要大,会导致请求或响应异常

downstream.connectionBufferLimits 也不建议配置过大,可能导致网络传输慢时,网关内存占用过高