AltraMayor / gatekeeper

The first open-source DDoS protection system
https://github.com/AltraMayor/gatekeeper/wiki
GNU General Public License v3.0
1.3k stars 228 forks source link

1.2 dev has the following exception after running for two months. There is a CPU timer exception #703

Open ShawnLeung87 opened 2 weeks ago

ShawnLeung87 commented 2 weeks ago

1.2 dev has the following exception after running for two months. There is a CPU timer exception. The error is as follows: Both gatekeeper and grantor appear, but not all machines. Currently, one gatekeeper appears and one grantor appears. gatekeeper Exception Error:

[Wed Aug 28 03:30:26 2024] ------------[ cut here ]------------ [Wed Aug 28 03:30:26 2024] WARNING: CPU: 56 PID: 0 at kernel/time/timer.c:1729 run_timers.part.0+0xc7/0x250 [Wed Aug 28 03:30:26 2024] Modules linked in: vhost_net vhost vhost_iotlb tap joydev input_leds hid_generic usbhid hid nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp binfmt_misc kvm_intel dell_smbios dcdbas kvm dell_wmi_descriptor wmi_bmof rapl intel_cstate intel_pch_thermal mei_me mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel ramoops uio_pci_generic uio reed_solomon efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rc_core aesni_intel crypto_simd cryptd glue_helper i40e(OE) i2c_i801 megaraid_sas drm tg3 ahci xhci_pci i2c_smbus lpc_ich libahci [Wed Aug 28 03:30:26 2024] xhci_pci_renesas wmi [Wed Aug 28 03:30:26 2024] CPU: 56 PID: 0 Comm: swapper/56 Tainted: G IOE 5.11.0-46-generic #51~20.04.1-Ubuntu [Wed Aug 28 03:30:26 2024] Hardware name: Dell Inc. PowerEdge R740xd/06WXJT, BIOS 2.11.2 004/21/2021 [Wed Aug 28 03:30:26 2024] RIP: 0010:run_timers.part.0+0xc7/0x250 [Wed Aug 28 03:30:26 2024] Code: 89 ce a8 07 75 0f 83 c2 40 48 c1 e8 03 81 fa 40 02 00 00 75 b8 49 8b 44 24 10 48 83 c0 01 85 db 75 64 41 80 7c 24 24 00 75 02 <0f> 0b 49 89 44 24 10 4c 89 e7 e8 1a fb ff ff 49 89 44 24 18 49 8b [Wed Aug 28 03:30:26 2024] RSP: 0018:ffffb0290d778ef0 EFLAGS: 00010046 [Wed Aug 28 03:30:26 2024] RAX: 00000001400084f9 RBX: 0000000000000000 RCX: 000000000000005f [Wed Aug 28 03:30:26 2024] RDX: 0000000000000040 RSI: ffffb0290d778ef8 RDI: ffff9f12ffd1df80 [Wed Aug 28 03:30:26 2024] RBP: ffffb0290d778f70 R08: ffff9f12ffd1dfa8 R09: 0000000000000039 [Wed Aug 28 03:30:26 2024] R10: 0000000000000000 R11: 001e8489b09386a8 R12: ffff9f12ffd1df80 [Wed Aug 28 03:30:26 2024] R13: 0000000000000001 R14: 0000000000000001 R15: ffffffffab8060c8 [Wed Aug 28 03:30:26 2024] FS: 0000000000000000(0000) GS:ffff9f12ffd00000(0000) knlGS:0000000000000000 [Wed Aug 28 03:30:26 2024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Wed Aug 28 03:30:26 2024] CR2: 00007fa3f0b52f90 CR3: 0000007931c10002 CR4: 00000000007706e0 [Wed Aug 28 03:30:26 2024] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [Wed Aug 28 03:30:26 2024] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [Wed Aug 28 03:30:26 2024] PKRU: 55555554 [Wed Aug 28 03:30:26 2024] Call Trace: [Wed Aug 28 03:30:26 2024] [Wed Aug 28 03:30:26 2024] ? lapic_next_deadline+0x2c/0x40 [Wed Aug 28 03:30:26 2024] ? clockevents_program_event+0x8f/0xe0 [Wed Aug 28 03:30:26 2024] run_timer_softirq+0x4b/0x50 [Wed Aug 28 03:30:26 2024] __do_softirq+0xe0/0x29b [Wed Aug 28 03:30:26 2024] asm_call_irq_on_stack+0xf/0x20 [Wed Aug 28 03:30:26 2024] [Wed Aug 28 03:30:26 2024] do_softirq_own_stack+0x3d/0x50 [Wed Aug 28 03:30:26 2024] irq_exit_rcu+0xa4/0xb0 [Wed Aug 28 03:30:26 2024] sysvec_apic_timer_interrupt+0x3d/0x90 [Wed Aug 28 03:30:26 2024] asm_sysvec_apic_timer_interrupt+0x12/0x20 [Wed Aug 28 03:30:26 2024] RIP: 0010:cpuidle_enter_state+0xdf/0x350 [Wed Aug 28 03:30:26 2024] Code: ff e8 a5 8a 77 ff 80 7d d7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 52 02 00 00 31 ff e8 a8 37 7e ff fb 66 0f 1f 44 00 00 <45> 85 ff 0f 88 e5 00 00 00 49 63 c7 4c 2b 75 c8 48 8d 14 40 48 8d [Wed Aug 28 03:30:26 2024] RSP: 0018:ffffb0290cac7e48 EFLAGS: 00000246 [Wed Aug 28 03:30:26 2024] RAX: ffff9f12ffd2c380 RBX: 0000000000000001 RCX: 000000000000001f [Wed Aug 28 03:30:26 2024] RDX: 0000000000000000 RSI: 0000000035756a0a RDI: 0000000000000000 [Wed Aug 28 03:30:26 2024] RBP: ffffb0290cac7e80 R08: 000f42b017e2b6a8 R09: 000f42a667e5210b [Wed Aug 28 03:30:26 2024] R10: ffff9f12ffd2afa4 R11: ffff9f12ffd2af84 R12: ffffcfe8ffd06b70 [Wed Aug 28 03:30:26 2024] R13: ffffffffaba4d8e0 R14: 000f42b017e2b6a8 R15: 0000000000000001 [Wed Aug 28 03:30:26 2024] ? cpuidle_enter_state+0xbb/0x350 [Wed Aug 28 03:30:26 2024] cpuidle_enter+0x2e/0x40 [Wed Aug 28 03:30:26 2024] call_cpuidle+0x23/0x40 [Wed Aug 28 03:30:26 2024] do_idle+0x1df/0x260 [Wed Aug 28 03:30:26 2024] cpu_startup_entry+0x20/0x30 [Wed Aug 28 03:30:26 2024] start_secondary+0x11f/0x160 [Wed Aug 28 03:30:26 2024] secondary_startup_64_no_verify+0xc2/0xcb [Wed Aug 28 03:30:26 2024] ---[ end trace 4fcbce5fa20b53b7 ]---

grantor Exception Error:

[Wed Aug 28 03:21:58 2024] ------------[ cut here ]------------ [Wed Aug 28 03:21:58 2024] WARNING: CPU: 15 PID: 0 at kernel/time/timer.c:1729 run_timers.part.0+0xc7/0x250 [Wed Aug 28 03:21:58 2024] Modules linked in: vhost_net vhost vhost_iotlb tap nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp binfmt_misc coretemp ipmi_ssif kvm_intel kvm dell_smbios dcdbas rapl wmi_bmof dell_wmi_descriptor intel_cstate intel_pch_thermal mei_me mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel msr ramoops reed_solomon efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 i2c_algo_bit drm_kms_helper crct10dif_pclmul syscopyarea sysfillrect crc32_pclmul sysimgblt ghash_clmulni_intel aesni_intel fb_sys_fops crypto_simd cec cryptd i40e(OE) megaraid_sas glue_helper rc_core ahci lpc_ich drm xhci_pci i2c_i801 tg3 xhci_pci_renesas i2c_smbus libahci wmi [Wed Aug 28 03:21:58 2024] CPU: 15 PID: 0 Comm: swapper/15 Tainted: G IOE 5.11.0-46-generic #51~20.04.1-Ubuntu [Wed Aug 28 03:21:58 2024] Hardware name: Dell Inc. PowerEdge R740xd/06WXJT, BIOS 2.11.2 004/21/2021 [Wed Aug 28 03:21:58 2024] RIP: 0010:run_timers.part.0+0xc7/0x250 [Wed Aug 28 03:21:58 2024] Code: 89 ce a8 07 75 0f 83 c2 40 48 c1 e8 03 81 fa 40 02 00 00 75 b8 49 8b 44 24 10 48 83 c0 01 85 db 75 64 41 80 7c 24 24 00 75 02 <0f> 0b 49 89 44 24 10 4c 89 e7 e8 1a fb ff ff 49 89 44 24 18 49 8b [Wed Aug 28 03:21:58 2024] RSP: 0000:ffffaa644ce24ef0 EFLAGS: 00010046 [Wed Aug 28 03:21:58 2024] RAX: 000000013ffee379 RBX: 0000000000000000 RCX: 000000000000006f [Wed Aug 28 03:21:58 2024] RDX: 0000000000000040 RSI: ffffaa644ce24ef8 RDI: ffff8d2fff7ddf80 [Wed Aug 28 03:21:58 2024] RBP: ffffaa644ce24f70 R08: ffff8d2fff7ddfa8 R09: 0000000000000010 [Wed Aug 28 03:21:58 2024] R10: 0000000000000000 R11: 001c9c591f575c54 R12: ffff8d2fff7ddf80 [Wed Aug 28 03:21:58 2024] R13: 0000000000000001 R14: 0000000000000001 R15: ffffffff856060c8 [Wed Aug 28 03:21:58 2024] FS: 0000000000000000(0000) GS:ffff8d2fff7c0000(0000) knlGS:0000000000000000 [Wed Aug 28 03:21:58 2024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Wed Aug 28 03:21:58 2024] CR2: 0000000000000000 CR3: 0000007cdec10001 CR4: 00000000007706e0 [Wed Aug 28 03:21:58 2024] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [Wed Aug 28 03:21:58 2024] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [Wed Aug 28 03:21:58 2024] PKRU: 55555554 [Wed Aug 28 03:21:58 2024] Call Trace: [Wed Aug 28 03:21:58 2024] [Wed Aug 28 03:21:58 2024] ? lapic_next_deadline+0x2c/0x40 [Wed Aug 28 03:21:58 2024] ? clockevents_program_event+0x8f/0xe0 [Wed Aug 28 03:21:58 2024] run_timer_softirq+0x4b/0x50 [Wed Aug 28 03:21:58 2024] __do_softirq+0xe0/0x29b [Wed Aug 28 03:21:58 2024] asm_call_irq_on_stack+0xf/0x20 [Wed Aug 28 03:21:58 2024] [Wed Aug 28 03:21:58 2024] do_softirq_own_stack+0x3d/0x50 [Wed Aug 28 03:21:58 2024] irq_exit_rcu+0xa4/0xb0 [Wed Aug 28 03:21:58 2024] sysvec_apic_timer_interrupt+0x3d/0x90 [Wed Aug 28 03:21:58 2024] asm_sysvec_apic_timer_interrupt+0x12/0x20 [Wed Aug 28 03:21:58 2024] RIP: 0010:cpuidle_enter_state+0xdf/0x350 [Wed Aug 28 03:21:58 2024] Code: ff e8 a5 8a 77 ff 80 7d d7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 52 02 00 00 31 ff e8 a8 37 7e ff fb 66 0f 1f 44 00 00 <45> 85 ff 0f 88 e5 00 00 00 49 63 c7 4c 2b 75 c8 48 8d 14 40 48 8d [Wed Aug 28 03:21:58 2024] RSP: 0000:ffffaa644c987e48 EFLAGS: 00000246 [Wed Aug 28 03:21:58 2024] RAX: ffff8d2fff7ec380 RBX: 0000000000000003 RCX: 000000000000001f [Wed Aug 28 03:21:58 2024] RDX: 0000000000000000 RSI: 000000002c3ddb16 RDI: 0000000000000000 [Wed Aug 28 03:21:58 2024] RBP: ffffaa644c987e80 R08: 000f424e68e37934 R09: 000000000001f200 [Wed Aug 28 03:21:58 2024] R10: 0000000000000001 R11: ffff8d2fff7eaf84 R12: ffffca643f7c5618 [Wed Aug 28 03:21:58 2024] R13: ffffffff8584d8e0 R14: 000f424e68e37934 R15: 0000000000000003 [Wed Aug 28 03:21:58 2024] ? cpuidle_enter_state+0xbb/0x350 [Wed Aug 28 03:21:58 2024] cpuidle_enter+0x2e/0x40 [Wed Aug 28 03:21:58 2024] call_cpuidle+0x23/0x40 [Wed Aug 28 03:21:58 2024] do_idle+0x1df/0x260 [Wed Aug 28 03:21:58 2024] cpu_startup_entry+0x20/0x30 [Wed Aug 28 03:21:58 2024] start_secondary+0x11f/0x160 [Wed Aug 28 03:21:58 2024] secondary_startup_64_no_verify+0xc2/0xcb [Wed Aug 28 03:21:58 2024] ---[ end trace 139d1bee5c1cbe2d ]---

AltraMayor commented 2 weeks ago

Hi @ShawnLeung87,

Those two kernel crashes are likely related since they happened less than 10 minutes apart and have the same call trace. However, the information shows no relationship with DPDK or Gatekeeper. Was a core dump for Gatekeeper's process produced on any machine?

ShawnLeung87 commented 2 weeks ago

No core dump has been found so far. When this exception occurs, some lcore data of gatekeeper and grantor cannot be processed, resulting in data being discarded directly. Restarting the service may not solve the problem. The complete solution is to restart the machine.

AltraMayor commented 2 weeks ago

The symptoms you've described suggest you are dealing with a hardware issue. A wild hypothesis is that this power related. You can replace the power suppliers of these machines.

Gatekeeper v1.2 RC2, which I released yesterday, runs on Ubuntu 24.04 LTS. The newer kernel might report more information on this issue.

ShawnLeung87 commented 1 day ago

Check the logs, the related logs are fwupd automatically updating the firmware, not sure if this is the reason. I have shut down this service for a week, and gatekeeper has been running well during this period. I need to continue to observe for a month. If there is no problem, I will close this issue