DeskPi-Team / super6c

Super6c stands for Super 6 CM4 Cluster.
MIT License
70 stars 4 forks source link

On-board network fails completely ~once a week #22

Closed rssc closed 2 weeks ago

rssc commented 1 year ago

I am not sure where else to report this, so posting it here.

I run the Super6C with 5 CM4 modules installed, running Raspbian, and I have an issue where roughly once a week the on-board network seems to completely stop working, with all the CM4 modules being unable to communicate with the network (either among themselves or with the outside). The only remedy in that case is to power off the board and power it on again.

This happened again this morning, and in that case at around 00:45 the CM4s started to lose connectivity to other CM4s, but with ping from external still partly working for another 15m or so for some of the modules, and after that there was no more connectivity either between the CM4s or from outside to any of the CM4s.

One of the CM4s printed this error message at the time:

Mar 18 01:01:55 node1 kernel: [495170.760623] ------------[ cut here ]------------
Mar 18 01:01:55 node1 kernel: [495170.760677] NETDEV WATCHDOG: eth0 (bcmgenet): transmit queue 4 timed out
Mar 18 01:01:55 node1 kernel: [495170.760759] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:478 dev_watchdog+0x398/0x3a0
Mar 18 01:01:55 node1 kernel: [495170.760798] Modules linked in: xt_REDIRECT ip_vs_rr xt_ipvs xt_state ip_vs xt_nat veth vxlan ip6_udp_tunnel udp_tunnel xt_policy xt_mark xt_u32 xt_tcpudp cn xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge cbc aes_arm64 aes_generic libaes ceph libceph overlay cfg80211 rfkill 8021q garp stp llc vc4 snd_soc_hdmi_codec cec v3d drm_kms_helper raspberrypi_hwmon gpu_sched i2c_brcmstb rpivid_hevc(C) bcm2835_isp(C) snd_soc_core bcm2835_v4l2(C) bcm2835_codec(C) videobuf2_vmalloc snd_compress bcm2835_mmal_vchiq(C) snd_pcm_dmaengine v4l2_mem2mem videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 snd_bcm2835(C) videobuf2_common snd_pcm videodev vc_sm_cma(C) snd_timer mc snd syscopyarea sysfillrect sysimgblt uio_pdrv_genirq fb_sys_fops uio nvmem_rmem drm fuse drm_panel_orientation_quirks backlight ip_tables x_tables ipv6
Mar 18 01:01:55 node1 kernel: [495170.761263] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G         C        5.15.76-v8+ #1597
Mar 18 01:01:55 node1 kernel: [495170.761277] Hardware name: Raspberry Pi Compute Module 4 Rev 1.1 (DT)
Mar 18 01:01:55 node1 kernel: [495170.761286] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Mar 18 01:01:55 node1 kernel: [495170.761300] pc : dev_watchdog+0x398/0x3a0
Mar 18 01:01:55 node1 kernel: [495170.761316] lr : dev_watchdog+0x398/0x3a0
Mar 18 01:01:55 node1 kernel: [495170.761330] sp : ffffffc00800bd10
Mar 18 01:01:55 node1 kernel: [495170.761337] x29: ffffffc00800bd10 x28: ffffff80414a8580 x27: 0000000000000004
Mar 18 01:01:55 node1 kernel: [495170.761362] x26: 0000000000000140 x25: 00000000ffffffff x24: 0000000000000001
Mar 18 01:01:55 node1 kernel: [495170.761385] x23: ffffffe88bd36000 x22: ffffff80414a03dc x21: ffffff80414a0000
Mar 18 01:01:55 node1 kernel: [495170.761406] x20: ffffff80414a0480 x19: 0000000000000004 x18: 0000000000000000
Mar 18 01:01:55 node1 kernel: [495170.761426] x17: ffffff97f4126000 x16: ffffffc00800c000 x15: ffffffffffffffff
Mar 18 01:01:55 node1 kernel: [495170.761447] x14: ffffffe88b89b8a8 x13: 74756f2064656d69 x12: ffffffe88bdc6660
Mar 18 01:01:55 node1 kernel: [495170.761468] x11: 0000000000000003 x10: ffffffe88bdae620 x9 : ffffffe88aaee89c
Mar 18 01:01:55 node1 kernel: [495170.761488] x8 : 0000000000017fe8 x7 : 0000000000000003 x6 : 0000000000000000
Mar 18 01:01:55 node1 kernel: [495170.761508] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000103
Mar 18 01:01:55 node1 kernel: [495170.761528] x2 : 0000000000000102 x1 : b6e272a0ea7b5b00 x0 : 0000000000000000
Mar 18 01:01:55 node1 kernel: [495170.761549] Call trace:
Mar 18 01:01:55 node1 kernel: [495170.761556]  dev_watchdog+0x398/0x3a0
Mar 18 01:01:55 node1 kernel: [495170.761572]  call_timer_fn+0x38/0x1d8
Mar 18 01:01:55 node1 kernel: [495170.761588]  run_timer_softirq+0x284/0x520
Mar 18 01:01:55 node1 kernel: [495170.761600]  __do_softirq+0x1a8/0x4ec
Mar 18 01:01:55 node1 kernel: [495170.761611]  irq_exit+0x110/0x150
Mar 18 01:01:55 node1 kernel: [495170.761626]  handle_domain_irq+0x9c/0xe0
Mar 18 01:01:55 node1 kernel: [495170.761642]  gic_handle_irq+0xac/0xe8
Mar 18 01:01:55 node1 kernel: [495170.761652]  call_on_irq_stack+0x28/0x54
Mar 18 01:01:55 node1 kernel: [495170.761664]  do_interrupt_handler+0x60/0x70
Mar 18 01:01:55 node1 kernel: [495170.761676]  el1_interrupt+0x30/0x78
Mar 18 01:01:55 node1 kernel: [495170.761688]  el1h_64_irq_handler+0x18/0x28
Mar 18 01:01:55 node1 kernel: [495170.761699]  el1h_64_irq+0x78/0x7c
Mar 18 01:01:55 node1 kernel: [495170.761708]  arch_cpu_idle+0x18/0x28
Mar 18 01:01:55 node1 kernel: [495170.761719]  default_idle_call+0x54/0x19c
Mar 18 01:01:55 node1 kernel: [495170.761737]  do_idle+0x254/0x268
Mar 18 01:01:55 node1 kernel: [495170.761750]  cpu_startup_entry+0x2c/0x80
Mar 18 01:01:55 node1 kernel: [495170.761762]  secondary_start_kernel+0x154/0x168
Mar 18 01:01:55 node1 kernel: [495170.761776]  __secondary_switched+0x90/0x94
Mar 18 01:01:55 node1 kernel: [495170.761789] ---[ end trace 44776d4474b5937d ]---

In other cases, the same watchdog message was printed, and after that the following messages appeared:

Mar 12 02:19:24 node2 kernel: [64815.604883] bcmgenet fd580000.ethernet eth0: bcmgenet_xmit: tx ring 0 full when queue 1 awake
Mar 12 02:19:24 node2 kernel: [64815.604955] bcmgenet fd580000.ethernet eth0: bcmgenet_xmit: tx ring 1 full when queue 2 awake
Mar 12 02:19:24 node2 kernel: [64815.604984] bcmgenet fd580000.ethernet eth0: bcmgenet_xmit: tx ring 2 full when queue 3 awake
Mar 12 02:19:24 node2 kernel: [64815.605008] bcmgenet fd580000.ethernet eth0: bcmgenet_xmit: tx ring 3 full when queue 4 awake

Mar 12 02:19:26 node2 kernel: [64817.588950] bcmgenet fd580000.ethernet eth0: bcmgenet_xmit: tx ring 0 full when queue 1 awake
Mar 12 02:19:26 node2 kernel: [64817.589000] bcmgenet fd580000.ethernet eth0: bcmgenet_xmit: tx ring 1 full when queue 2 awake
Mar 12 02:19:26 node2 kernel: [64817.589026] bcmgenet fd580000.ethernet eth0: bcmgenet_xmit: tx ring 2 full when queue 3 awake
Mar 12 02:19:26 node2 kernel: [64817.589050] bcmgenet fd580000.ethernet eth0: bcmgenet_xmit: tx ring 3 full when queue 4 awake

Mar 12 02:19:28 node2 kernel: [64819.604946] bcmgenet fd580000.ethernet eth0: bcmgenet_xmit: tx ring 0 full when queue 1 awake
Mar 12 02:19:28 node2 kernel: [64819.604995] bcmgenet fd580000.ethernet eth0: bcmgenet_xmit: tx ring 1 full when queue 2 awake
Mar 12 02:19:28 node2 kernel: [64819.605022] bcmgenet fd580000.ethernet eth0: bcmgenet_xmit: tx ring 2 full when queue 3 awake
Mar 12 02:19:28 node2 kernel: [64819.605045] bcmgenet fd580000.ethernet eth0: bcmgenet_xmit: tx ring 3 full when queue 4 awake

I have tried the following so far:

Considering that all CM4s lose network connectivity at the same time, and they have different uptimes, this suggests to me that the issue is with the network (network chip?) on the board itself.

I have regular Raspberry Pi 4s that I understand have the exact same Ethernet interface, and I am not seeing this kind of issues with them, so I suspect the issue is either on the Super6C board itself, or the interplay of the RPi4 Ethernet and the Ethernet chip on the board (both of which seem like they would be hard to fix).

FWIW, the external Ethernet interface is connected to a Unifi switch, although I don't think that should matter (and I've had it connected to two different generations of Unifi switches with the same issue occurring).

Has anybody seen this happening to them too? And if so, are there any fixes for this? Or is this indicative of a hardware issue with the board, and I should see to get the board replaced?

Any help would be greatly appreciated!

Thanks!

yoyojacky commented 5 months ago

did you turn off all of the wifi adpater on CM4 ?