Closed allenporter closed 3 years ago
Following advice in https://jvns.ca/blog/2017/09/05/finding-out-where-packets-are-being-dropped/ taking a look at building dropwatch from https://github.com/nhorman/dropwatch
$ sudo ./dropwatch -l kas
Initializing kallsyms db
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
46 drops at ip_rcv_finish_core.isra.0+1b2 (0xffffffff8c5601e2) [software]
47 drops at ip6_mc_input+1ed (0xffffffff8c5efa5d) [software]
1 drops at __udp4_lib_rcv+aef (0xffffffff8c59c64f) [software]
1 drops at __netif_receive_skb_core+14f (0xffffffff8c4e6f3f) [software]
11 drops at netlink_broadcast_filtered+257 (0xffffffff8c551937) [software]
57 drops at ip_rcv_finish_core.isra.0+1b2 (0xffffffff8c5601e2) [software]
57 drops at ip6_mc_input+1ed (0xffffffff8c5efa5d) [software]
2 drops at __udp4_lib_rcv+aef (0xffffffff8c59c64f) [software]
2 drops at skb_release_data+b4 (0xffffffff8c4cea44) [software]
63 drops at ip6_mc_input+1ed (0xffffffff8c5efa5d) [software]
62 drops at ip_rcv_finish_core.isra.0+1b2 (0xffffffff8c5601e2) [software]
1 drops at __netif_receive_skb_core+14f (0xffffffff8c4e6f3f) [software]
1 drops at __udp4_lib_rcv+aef (0xffffffff8c59c64f) [software]
53 drops at ip_rcv_finish_core.isra.0+1b2 (0xffffffff8c5601e2) [software]
51 drops at ip6_mc_input+1ed (0xffffffff8c5efa5d) [software]
2 drops at sk_stream_kill_queues+55 (0xffffffff8c4d5635) [software]
1 drops at sk_stream_kill_queues+55 (0xffffffff8c4d5635) [software]
55 drops at ip6_mc_input+1ed (0xffffffff8c5efa5d) [software]
56 drops at ip_rcv_finish_core.isra.0+1b2 (0xffffffff8c5601e2) [software]
1 drops at __netif_receive_skb_core+14f (0xffffffff8c4e6f3f) [software]
53 drops at ip_rcv_finish_core.isra.0+1b2 (0xffffffff8c5601e2) [software]
49 drops at ip6_mc_input+1ed (0xffffffff8c5efa5d) [software]
1 drops at __udp4_lib_rcv+aef (0xffffffff8c59c64f) [software]
51 drops at ip6_mc_input+1ed (0xffffffff8c5efa5d) [software]
53 drops at ip_rcv_finish_core.isra.0+1b2 (0xffffffff8c5601e2) [software]
1 drops at __netif_receive_skb_core+14f (0xffffffff8c4e6f3f) [software]
55 drops at ip_rcv_finish_core.isra.0+1b2 (0xffffffff8c5601e2) [software]
53 drops at ip6_mc_input+1ed (0xffffffff8c5efa5d) [software]
1 drops at __udp4_lib_rcv+aef (0xffffffff8c59c64f) [software]
1 drops at __netif_receive_skb_core+14f (0xffffffff8c4e6f3f) [software]
Looking at ip_rcv_finish_core - https://github.com/torvalds/linux/blob/master/net/ipv4/ip_input.c#L315 -- there are 10 places in that function were drops can happen.
The drops appear to happen once per second.
$ watch --difference --interval 0.5 "ifconfig eth0 | grep drop"
When running tcpdump, the drops stop! It appears though, that there is a Spanning Tree Protocol packet once per second that corresponds roughly with the drop:
14:17:22.493978 STP 802.1s, Rapid STP, CIST Flags [Proposal, Learn, Forward], length 102
The symptoms sound similar to this: https://forum.proxmox.com/threads/vm-multicast-vrrp-packets-drop.57407/
The default ceph alerts identified that many of the proxmox hosts are dropping packets.