evilsocket / opensnitch

OpenSnitch is a GNU/Linux interactive application firewall inspired by Little Snitch.
GNU General Public License v3.0
9.92k stars 490 forks source link

eBPF kernel oops after system/kernel update (5.14 kernel) #732

Closed pk-pavlk closed 1 year ago

pk-pavlk commented 1 year ago

Describe the bug I am getting kernel oops after upgrading openSUSE Leap 15.3 to 15.4 (which bumped kernel version from 5.3 to 5.14 and upgraded a lot of packages):

BUG: unable to handle page fault for address: ffffffffc1eb5bcc
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
...
Call Trace:
 <TASK>
 ? udp_sendmsg+0x5/0xe50
 ? sock_sendmsg+0x58/0x70
 ? ____sys_sendmsg+0x1ee/0x250

(full dump below in the error log section)

The oops seems to be caused by eBPF filter, since it comes from functions tcp_v6_connect, udp_sendmsg, tcp_v4_connect, which are all probed by it.

This happens on every attempted network connection (and prevents the connection from working) and stops if I stop opensnitchd (which unloads the eBPF) or if I switch ProcMonitorMethod to something other than ebpf.

I used compiled opensnitch 1.4.0rc2 (+compiled eBPF) which broke after upgrade, so I tried it with latest code from master (and compiled latest eBPF modules as well) and it behaves the same.

I have read the #297, but my call trace does not seem to include the [nfnetlink] calls, so I think this is a different problem. My libnetfilter_queue version is 1.0.3-1.16, but I am not sure what patches openSUSE includes (the version did not change during the 15.3->15.4 upgrade).

Version information:

To Reproduce

I can reproduce it on 2 physical computers (same CPU and GPU models, different motherboards). I cannot reproduce this in a VM.

Post error logs:

BUG: unable to handle page fault for address: ffffffffc1eb5bcc
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 42b015067 P4D 42b015067 PUD 42b017067 PMD 0 
Oops: 0010 [#2] PREEMPT SMP NOPTI
CPU: 9 PID: 1672 Comm: chronyd Tainted: G      D           N 5.14.21-150400.24.18-default #1 SLE15-SP4 695ab7a8fc20f5ddb345280570966cd1eb06d469
Hardware name: ASUS System Product Name/TUF GAMING B550M-PLUS, BIOS 1202 10/22/2020
RIP: 0010:0xffffffffc1eb5bcc
Code: Unable to access opcode bytes at RIP 0xffffffffc1eb5ba2.
<register content ommited>
Call Trace:
 <TASK>
 ? udp_sendmsg+0x5/0xe50
 ? sock_sendmsg+0x58/0x70
 ? ____sys_sendmsg+0x1ee/0x250
 ? copy_msghdr_from_user+0x5c/0x90
 ? ___sys_sendmsg+0x88/0xd0
 ? release_sock+0x43/0x90
 ? sock_setsockopt+0x435/0xe00
 ? __sys_sendmsg+0x5e/0xa0
 ? __sys_sendmsg+0x5e/0xa0
 ? do_syscall_64+0x5b/0x80
 ? do_syscall_64+0x67/0x80
 ? syscall_exit_to_user_mode+0x18/0x40
 ? do_syscall_64+0x67/0x80
 ? syscall_exit_to_user_mode+0x18/0x40
 ? do_syscall_64+0x67/0x80
 ? irq_exit_rcu+0x41/0xc0
 ? entry_SYSCALL_64_after_hwframe+0x61/0xcb
 </TASK>
Modules linked in: nf_conntrack_netlink nft_queue nfnetlink_queue tcp_diag inet_diag af_packet nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter dmi_sysfs snd_hda_codec_realtek intel_rapl_msr nls_iso8859_1 intel_rapl_common snd_hda_codec_generic nls_cp437 ledtrig_audio vfat fat snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec edac_mce_amd ext4 snd_hda_core snd_hwdep snd_pcm eeepc_wmi(N) asus_wmi kvm_amd battery snd_timer sparse_keymap crc16 kvm irqbypass r8169 snd pcspkr mbcache platform_profile rfkill efi_pstore(N) realtek jbd2 video joydev wmi_bmof mdio_devres
 i2c_piix4 k10temp soundcore libphy acpi_cpufreq gpio_amdpt gpio_generic button fuse configfs ip_tables x_tables dm_crypt essiv authenc amdgpu hid_generic drm_ttm_helper ttm mfd_core iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper usbhid syscopyarea sysfillrect sysimgblt crc32_pclmul fb_sys_fops cec rc_core ahci xhci_pci ghash_clmulni_intel xhci_pci_renesas drm aesni_intel xhci_hcd crypto_simd cryptd libahci nvme ccp usbcore libata sp5100_tco(N) nvme_core t10_pi wmi btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod msr efivarfs
Supported: No, Unsupported modules are loaded
CR2: ffffffffc1eb5bcc
---[ end trace 9e6812b32a4c3ffa ]---
RIP: 0010:0xffffffffc1eb5bcc
Code: Unable to access opcode bytes at RIP 0xffffffffc1eb5ba2.
<register content ommited>

(there are variations, they differ in the call trace - but the top function is always one of those probed by ebpf)

I have also tried using the non-stripped ebpf modules, but it does not seem to produce more detailed error.

I am not sure how to debug this further, is there anything I could test or more information to collect? Thanks for your help.

gustavo-iniguez-goya commented 1 year ago

ouch. Thank you @pk-pavlk for reporting this problem, I'll try to reproduce it. I've tested it extensively with kernel 5.14, but on debian, ubuntu.

Could you install the package bcc-tools and execute tcpconnect or other tools to see if they also generate kernel oops? They're located under /usr/share/bcc/tools/

pk-pavlk commented 1 year ago

It seems I get the same oops when tcpconnect from bcc-tools is running (well, similar, this generates the #PF: error_code(0x0011) - permissions violation, but I guess it depends on how it goes wrong).

Which also means that this is in no way a problem in opensnitch ebpf. Sorry for the false report and thank you for pointing me to this way to test it.

gustavo-iniguez-goya commented 1 year ago

woah! thank $god the oops are not caused by us :) These issues are always hard to debug.

I guess that there'll be already an issue reported to SuSe, so if you have the link post it here please, I'd like to read the details.

pk-pavlk commented 1 year ago

Yeah, I tried to debug it before I posted this and couldn't really find much on debugging bpf.

For now, I couldn't find a relevant bug on the SUSE (or any other) tracker. I will experiment further and if I find a solution or create the bugreport myself, I will post it here in case others encounter the same problem.

In the meantime, we can close this issue since it is not caused by opensnitch. Thank you for your help.

pk-pavlk commented 1 year ago

I have raised a bugreport on SUSE kernel tracker, it was indeed a kernel bug. Should be fixed with the next kernel release:

https://bugzilla.suse.com/show_bug.cgi?id=1203103

Thanks again for pointing me to the way to test this.