Open bernhardschmidt opened 1 year ago
please take a look to https://github.com/OpenVPN/ovpn-dco/issues/11 and https://github.com/OpenVPN/ovpn-dco/issues/12 it doesn't even compile
Hi,
On Wed, Jan 04, 2023 at 01:46:36AM -0800, xevilstar wrote:
please take a look to https://github.com/OpenVPN/ovpn-dco/issues/11 and https://github.com/OpenVPN/ovpn-dco/issues/12 it doesn't even compile
Please do not mix issues.
Issue #13 is about a crash, on a Debian-supported kernel.
"Not compiling on a bleeding-edge kernel" is a different thing - which needs to be handled as well, of course, but "not in this issue".
gert -- "If was one thing all people took for granted, was conviction that if you feed honest figures into a computer, honest figures come out. Never doubted it myself till I met a computer with a sense of humor." Robert A. Heinlein, The Moon is a Harsh Mistress
Gert Doering - Munich, Germany @.***
May 13 12:41:26 XXXXXX kernel: [2856677.508346] tun5: deleting peer with id 2, reason 2
May 13 12:42:15 XXXXXX kernel: [2856726.661330] tun5: deleting peer with id 4, reason 2
May 13 12:42:15 XXXXXX kernel: [2856726.665323] tun5: deleting peer with id 0, reason 2
May 13 12:43:05 XXXXXX kernel: [2856775.798317] list_del corruption. next->prev should be ffff97c382cb8c60, but was 0000000000000000
May 13 12:43:05 XXXXXX kernel: [2856775.804950] ------------[ cut here ]------------
May 13 12:43:05 XXXXXX kernel: [2856775.808452] kernel BUG at lib/list_debug.c:54!
May 13 12:43:05 XXXXXX kernel: [2856775.811914] invalid opcode: 0000 [#1] SMP NOPTI
May 13 12:43:05 XXXXXX kernel: [2856775.815371] CPU: 3 PID: 194680 Comm: kworker/3:2 Tainted: G OE 5.10.0-14-cloud-amd64 #1 Debian 5.10.113-1
May 13 12:43:05 XXXXXX kernel: [2856775.822689] Hardware name: Amazon EC2 c6i.2xlarge/, BIOS 1.0 10/16/2017
May 13 12:43:05 XXXXXX kernel: [2856775.826949] Workqueue: ovpn-event-wq-tun5 ovpn_peer_delete_work [ovpn_dco_v2]
May 13 12:43:05 XXXXXX kernel: [2856775.832791] RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47
May 13 12:43:05 XXXXXX kernel: [2856775.836600] Code: c7 c7 70 d7 8d b4 e8 0c 15 ff ff 0f 0b 48 89 fe 48 c7 c7 00 d8 8d b4 e8 fb 14 ff ff 0f 0b 48 c7 c7 b0 d8 8d b4 e8 ed 14 ff ff <0f> 0b 48 89 f2 48 89 fe 48 c7 c7 70 d8 8d b4 e8 d9 14 ff ff 0f 0b
May 13 12:43:05 XXXXXX kernel: [2856775.848432] RSP: 0018:ffffb4ca86013e40 EFLAGS: 00010246
May 13 12:43:05 XXXXXX kernel: [2856775.852145] RAX: 0000000000000054 RBX: ffff97c382cb8800 RCX: 0000000000000000
May 13 12:43:05 XXXXXX kernel: [2856775.858043] RDX: 0000000000000000 RSI: ffff97c65f0dc380 RDI: ffff97c65f0dc380
May 13 12:43:05 XXXXXX kernel: [2856775.863954] RBP: ffff97c382cb8b00 R08: 0000000000000000 R09: ffffb4ca86013c68
May 13 12:43:05 XXXXXX kernel: [2856775.869842] R10: ffffb4ca86013c60 R11: ffffffffb4eb3650 R12: ffff97c382cb8b00
May 13 12:43:05 XXXXXX kernel: [2856775.875734] R13: 0000000000000000 R14: ffffd4ca7fcda800 R15: 0000000000000000
May 13 12:43:05 XXXXXX kernel: [2856775.881641] FS: 0000000000000000(0000) GS:ffff97c65f0c0000(0000) knlGS:0000000000000000
May 13 12:43:05 XXXXXX kernel: [2856775.887897] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 13 12:43:05 XXXXXX kernel: [2856775.891797] CR2: 00007ffda50df328 CR3: 000000010b72e003 CR4: 00000000007706a0
May 13 12:43:05 XXXXXX kernel: [2856775.897761] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 13 12:43:05 XXXXXX kernel: [2856775.903693] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
May 13 12:43:05 XXXXXX kernel: [2856775.909578] PKRU: 55555554
May 13 12:43:05 XXXXXX kernel: [2856775.912329] Call Trace:
May 13 12:43:05 XXXXXX kernel: [2856775.914962] __netif_napi_del+0x6c/0x100
May 13 12:43:05 XXXXXX kernel: [2856775.918129] ovpn_peer_release+0x21/0x80 [ovpn_dco_v2]
May 13 12:43:05 XXXXXX kernel: [2856775.921814] ovpn_peer_delete_work+0x15/0x20 [ovpn_dco_v2]
May 13 12:43:05 XXXXXX kernel: [2856775.925618] process_one_work+0x1b6/0x350
May 13 12:43:05 XXXXXX kernel: [2856775.928850] worker_thread+0x53/0x3e0
May 13 12:43:05 XXXXXX kernel: [2856775.931950] ? process_one_work+0x350/0x350
May 13 12:43:05 XXXXXX kernel: [2856775.935254] kthread+0x11b/0x140
May 13 12:43:05 XXXXXX kernel: [2856775.938168] ? __kthread_bind_mask+0x60/0x60
May 13 12:43:05 XXXXXX kernel: [2856775.941487] ret_from_fork+0x1f/0x30
May 13 12:43:05 XXXXXX kernel: [2856775.944561] Modules linked in: ovpn_dco_v2(OE) tun sctp xt_MASQUERADE msr ipt_REJECT nf_reject_ipv4 xt_set iptable_security iptable_nat nf_nat iptable_raw iptable_mangle ip_set_hash_net ip_set udp_diag tcp_diag inet_diag binfmt_misc ip6_udp_tunnel udp_tunnel nf_conntrack_netlink nfnetlink xt_tcpudp xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter intel_rapl_msr intel_rapl_common iosf_mbi crct10dif_pclmul crc32_pclmul nls_ascii nls_cp437 ghash_clmulni_intel vfat fat aesni_intel crypto_simd cryptd glue_helper rapl evdev serio_raw button tcp_bbr uslk(OE) fuse configfs ip_tables x_tables autofs4 ena(OE) crc32c_intel [last unloaded: ovpn_dco]
May 13 12:43:05 XXXXXX kernel: [2856775.981196] ---[ end trace 5710b115da3c68b5 ]---
May 13 12:43:05 XXXXXX kernel: [2856775.984716] RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47
May 13 12:43:05 XXXXXX kernel: [2856775.988556] Code: c7 c7 70 d7 8d b4 e8 0c 15 ff ff 0f 0b 48 89 fe 48 c7 c7 00 d8 8d b4 e8 fb 14 ff ff 0f 0b 48 c7 c7 b0 d8 8d b4 e8 ed 14 ff ff <0f> 0b 48 89 f2 48 89 fe 48 c7 c7 70 d8 8d b4 e8 d9 14 ff ff 0f 0b
May 13 12:43:05 XXXXXX kernel: [2856776.000441] RSP: 0018:ffffb4ca86013e40 EFLAGS: 00010246
May 13 12:43:05 XXXXXX kernel: [2856776.004154] RAX: 0000000000000054 RBX: ffff97c382cb8800 RCX: 0000000000000000
May 13 12:43:05 XXXXXX kernel: [2856776.010081] RDX: 0000000000000000 RSI: ffff97c65f0dc380 RDI: ffff97c65f0dc380
May 13 12:43:05 XXXXXX kernel: [2856776.016014] RBP: ffff97c382cb8b00 R08: 0000000000000000 R09: ffffb4ca86013c68
May 13 12:43:05 XXXXXX kernel: [2856776.021946] R10: ffffb4ca86013c60 R11: ffffffffb4eb3650 R12: ffff97c382cb8b00
May 13 12:43:05 XXXXXX kernel: [2856776.027920] R13: 0000000000000000 R14: ffffd4ca7fcda800 R15: 0000000000000000
May 13 12:43:05 XXXXXX kernel: [2856776.033818] FS: 0000000000000000(0000) GS:ffff97c65f0c0000(0000) knlGS:0000000000000000
May 13 12:43:05 XXXXXX kernel: [2856776.040116] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 13 12:43:05 XXXXXX kernel: [2856776.044041] CR2: 00007ffda50df328 CR3: 000000010b72e003 CR4: 00000000007706a0
May 13 12:43:05 XXXXXX kernel: [2856776.049967] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 13 12:43:05 XXXXXX kernel: [2856776.055926] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
May 13 12:43:05 XXXXXX kernel: [2856776.061855] PKRU: 55555554
May 13 12:43:54 XXXXXX kernel: [2856824.967287] tun5: deleting peer with id 6, reason 2
May 13 12:44:59 XXXXXX kernel: [2856890.504602] tun5: deleting peer with id 7, reason 2
I have experienced the same problem. However, there is no stable way to reproduce it, and it needs to be served continuously for some time before it appears.
openvpn 2.6.3 and dco 0.2.20230323
@ordex so maybe this is a race between "userland trying to close the tun device" and "a peer exiting this moment"? Or "userland trying to close the tun device, kernel trying to clean up all peers, and packets come by in the wrong moment"?
We might be able to reproduce this by
@cron2 reference counting + RCU should really take care of all these cases, therefore if the bug is in this area, it is definitely well hidden..
Can I do anything to help pinpoint this problem? This issue is causing DCO to be unavailable in my scenario
@mr-liusg should you have any way to easily reproduce the issue, please let me know the steps. This is what would help the most
Together with Gert we have been stress testing OpenVPN+DCO on one of our eduVPN nodes. It has been running for approximately 18 hours serving 10000 connections, before this kernel crash happened
Debian Bullseye, OpenVPN 2.6.0_rc1, ovpn-dco as of Oct 22th.