OpenCloudOS / nettrace

nettrace is a eBPF-based tool to trace network packet and diagnose network problem.
Other
326 stars 80 forks source link

跟踪报文延迟时出现连续非常大数值的延迟值(大于1个小时) #109

Closed weiqian closed 3 weeks ago

weiqian commented 2 months ago

Hi,你好,使用nettrace跟踪报文延迟时,出现许多latency: 4294966.302ms 这种非常大的值的延迟时间,报文延迟1个多小时不符合实际情况,麻烦看一下是什么问题,多谢!

nettrace版本信息: [root@cn1-instance-centos7u9-1 nettrace]# ./nettrace --version version: 1.2.10.tl3, inline, no-btf, kernel-3.10.0, no-global-data

发行版本信息: [root@cn1-instance-centos7u9-1 nettrace]# lsb_release -a LSB Version: :core-4.1-amd64:core-4.1-noarch Distributor ID: CentOS Description: CentOS Linux release 7.9.2009 (Core) Release: 7.9.2009 Codename: Core

内核版本信息: 3.10.0-1160.el7.x86_64

问题现象:出现连续非常大数值的延迟值 [root@cn1-instance-centos7u9-1 nettrace]# ./nettrace --latency -t netif_receive_skb_core,tcp_queue_rcv -p tcp -s 192.168.0.16 -d 192.168.0.11 -S 6379 --min-latency 1000 --date begin trace... [2024-8-26 20:12:22.277876] [__kfree_skb ][netif_receive_skb_core -> tcp_queue_rcv] TCP: 192.168.0.16:6379 -> 192.168.0.11:59520 seq:2216426147, ack:945579414, flags:AP latency: 1.017ms [2024-8-26 20:17:28.176228] [kfree_skb ][__netif_receive_skb_core -> tcp_queue_rcv] TCP: 192.168.0.16:6379 -> 192.168.0.11:59500 seq:2465430620, ack:1720638798, flags:AP latency: 1.002ms [2024-8-26 20:19:14.509039] [kfree_skb ][netif_receive_skb_core -> tcp_queue_rcv] TCP: 192.168.0.16:6379 -> 192.168.0.11:59502 seq:3175547634, ack:699419840, flags:AP latency: 4294966.302ms [2024-8-26 20:20:09.218026] [__kfree_skb ][netif_receive_skb_core -> tcp_queue_rcv] TCP: 192.168.0.16:6379 -> 192.168.0.11:59526 seq:2589816680, ack:1291703648, flags:AP latency: 4294966.302ms [2024-8-26 20:22:46.139043] [kfree_skb ][__netif_receive_skb_core -> tcp_queue_rcv] TCP: 192.168.0.16:6379 -> 192.168.0.11:59500 seq:2465667633, ack:1721112824, flags:AP latency: 4294966.311ms [2024-8-26 20:25:06.013049] [kfree_skb ][netif_receive_skb_core -> tcp_queue_rcv] TCP: 192.168.0.16:6379 -> 192.168.0.11:59500 seq:2465771324, ack:1721320206, flags:AP latency: 4294966.305ms [2024-8-26 20:25:22.192034] [__kfree_skb ][netif_receive_skb_core -> tcp_queue_rcv] TCP: 192.168.0.16:6379 -> 192.168.0.11:59512 seq:2265951563, ack:1262726542, flags:AP latency: 4294966.301ms

menglongdong commented 2 months ago

这里应该是没有最终到报文的释放事件导致的。由于内核里有可能会对skb释放事件进行内敛,所以是存在这种错乱的情况的。我打算后面搞一个可以指定skb终止函数的逻辑,来应对这个问题。

menglongdong commented 3 weeks ago

按照最新的文档3.6.1章节中的方式来跟踪网络延迟,应该是没问题的