Open paulg1981 opened 5 years ago
Same issue after upgrading to the latest release. The device has been stuck in a reboot loop with occasional ping responses in between.
Package: wireguard-e50-0.0.20190227-1 Device: ER-X-SFP Firmware: EdgeOS v1.10.9.5166958.190213.1952
Same issue for me on a ER-6P, I upgraded remotely and now the unit it down, no Internet at the site. Once I get serial access I can post more info.
What testing is done on these packages prior to being released?
Mar 2 14:33:13 USG3P kernel: CPU 1 Unable to handle kernel paging request at virtual address 0000000000000000, epc == ffffffffc012ced8, ra == ffffffffc0b9314c
Mar 2 14:33:13 USG3P kernel: Oops[#1]:
Mar 2 14:33:13 USG3P kernel: CPU: 1 PID: 4103 Comm: ip Tainted: P O 3.10.107-UBNT #1
Mar 2 14:33:13 USG3P kernel: task: 800000041c20e0e0 ti: 800000000c030000 task.ti: 800000000c030000
Mar 2 14:33:13 USG3P kernel: $ 0 : 0000000000000000 0000000000000004 ffffffffc0660000 ffffffffc050b3e8
Mar 2 14:33:13 USG3P kernel: $ 4 : 0000000000000001 00000000000012d0 ffffffffc0b9314c 800000000c033670
Mar 2 14:33:13 USG3P kernel: $ 8 : ffffffffffffff9d 800000041d296cc0 ffffffffc050b3e8 000000001a5f4728
Mar 2 14:33:13 USG3P kernel: $12 : 0000000000000008 ffffffffc025c878 ffffffffd76c0898 0000000000000000
Mar 2 14:33:13 USG3P kernel: $16 : 800000041d296000 0000000000000000 00000000000012d0 ffffffffc0531980
Mar 2 14:33:13 USG3P kernel: $20 : 800000041db09e10 800000041d296000 0000000000000000 ffffffffc080a380
Mar 2 14:33:13 USG3P kernel: $24 : 0000000005733924 0000000027f2031c
Mar 2 14:33:13 USG3P kernel: $28 : 800000000c030000 800000000c033710 800000000c033780 ffffffffc0b9314c
Mar 2 14:33:13 USG3P kernel: Hi : 0000000000000000
Mar 2 14:33:13 USG3P kernel: Lo : 1dcbc89e99000000
Mar 2 14:33:13 USG3P kernel: epc : ffffffffc012ced8 kmem_cache_alloc+0x30/0x150
Mar 2 14:33:13 USG3P kernel: Tainted: P O
Mar 2 14:33:13 USG3P kernel: ra : ffffffffc0b9314c wg_pubkey_hashtable_alloc+0x1c/0xd8 [wireguard]
Mar 2 14:33:13 USG3P kernel: Status: 10008ce3 KX SX UX KERNEL EXL IE
Mar 2 14:33:13 USG3P kernel: Cause : 00800008
Mar 2 14:33:13 USG3P kernel: BadVA : 0000000000000000
Mar 2 14:33:13 USG3P kernel: PrId : 000d0601 (Cavium Octeon+)
Mar 2 14:33:13 USG3P kernel: Modules linked in: wireguard(O) ip_tunnel xt_mark xt_nat 8021q garp stp llc ipt_MASQUERADE xt_set nf_conntrack_ipv6 nf_defrag_ipv6 xt_comment xt_conntrack ip_set_bitmap_port xt_TCPMSS xt_tcpudp ip6table_mangle ip6table_filter ip6table_raw ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_mangle xt_CT iptable_raw nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_h323 nf_conntrack_h323 nf_nat_proto_gre nf_nat_tftp nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables ip_set_hash_net ip_set nfnetlink configfs unifigpio(PO) unifihal(PO) cvm_ipsec_kame(O) ipv6 imq cavium_ip_offload(PO) ubnt_nf_app(PO) tdts(PO) octeon_rng rng_core octeon_ethernet mdio_octeon ethernet_mem octeon_common of_mdio ubnt_platform(PO) libphy [last unloaded: nf_conntrack_sip]
Mar 2 14:33:13 USG3P kernel: Process ip (pid: 4103, threadinfo=800000000c030000, task=800000041c20e0e0, tls=0000000077a5b490)
Mar 2 14:33:13 USG3P kernel: Stack : 800000041d296000 800000041d296680 800000000c033780 ffffffffc0b9314c
Mar 2 14:33:13 USG3P kernel: 800000041d296000 ffffffffc0b8d01c 800000041db09e00 800000041db09e00
Mar 2 14:33:13 USG3P kernel: ffffffffc0531980 800000000c033780 ffffffffc0531980 ffffffffc0346a5c
Mar 2 14:33:13 USG3P kernel: 800000000c033780 ffffffffc0346768 0000000000000000 0000000000000000
Mar 2 14:33:13 USG3P kernel: 0000000000000000 800000041db09e20 0000000000000000 0000000000000000
Mar 2 14:33:13 USG3P kernel: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Mar 2 14:33:13 USG3P kernel: last message repeated 2 times
Mar 2 14:33:13 USG3P kernel: 800000041db09e28 0000000000000000 0000000000000000 0000000000000000
Mar 2 14:33:13 USG3P kernel: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Mar 2 14:33:13 USG3P kernel: ...
Mar 2 14:33:13 USG3P kernel: Call Trace:
Mar 2 14:33:13 USG3P kernel: [<ffffffffc012ced8>] kmem_cache_alloc+0x30/0x150
Mar 2 14:33:13 USG3P kernel: [<ffffffffc0b9314c>] wg_pubkey_hashtable_alloc+0x1c/0xd8 [wireguard]
Mar 2 14:33:13 USG3P kernel: [<ffffffffc0b8d01c>] wg_newlink+0xac/0x3c8 [wireguard]
Mar 2 14:33:13 USG3P kernel: [<ffffffffc0346a5c>] rtnl_newlink+0x434/0x538
Mar 2 14:33:13 USG3P kernel:
Mar 2 14:33:13 USG3P kernel:
Mar 2 14:33:13 USG3P kernel: Code: 0080882d ffb00000 9f840020 <de220000> 000420f8 0064202d dc840000 0044382d dcec0008
Mar 2 14:33:13 USG3P kernel: ---[ end trace 0588e2b9fdef1fd0 ]---
Seems to be quite an issue. Maybe pull this release until more is known why this is happening.
@phillipmcmahon Agreed. I've pulled the 1.10 packages for now. As for testing before release - most of the time, there is none, as I don't really have equipment to test on.
@NimlothPL Thanks for the stacktrace. Seems related to this commit. I'll ask Jason about it.
I was able to fix this on a USG 4 Pro with the help of single user mode.
I connected a serial console cable and then caught the U-Boot console by pressing a key before it continued booting. You should see something like:
U-Boot 2012.04.01 (UBNT Build Version: e221_002_01aa9) (Aug 17 2018 - 01:13:14)
Skipping PCIe port 0 BIST, in EP mode, can't tell if clocked.
Skipping PCIe port 1 BIST, reset not done. (port not configured)
BIST check passed.
UBNT_E220 r1:1, r2:14, serial #: 000000FFFFFF
MPR 13-02102-14
Core clock: 1000 MHz, IO clock: 600 MHz, DDR clock: 533 MHz (1066 Mhz DDR)
Base DRAM address used by u-boot: 0x8f800000, size: 0x800000
DRAM: 2 GiB
Clearing DRAM...... done
Flash: 8 MiB
Net: octeth0, octeth1, octeth2, octeth3
MMC: Octeon MMC/SD0: 0
USB: USB EHCI 1.00
scanning bus for devices... 1 USB Device(s) found
Type the command 'usb start' to scan for USB storage devices.
Hit any key to stop autoboot: 0
Octeon ubnt_e220#
Once in the U-Boot console I ran printenv
to find the bootcmd
value.
Octeon ubnt_e220# printenv
autoload=n
baudrate=115200
boardname=ubnt_e220
bootcmd=fatload mmc 0 $(loadaddr) vmlinux.64;bootoctlinux $(loadaddr) numcores=2 endbootargs mem=0 root=/dev/mmcblk0p2 rootdelay=10 rw rootsqimg=squashfs.img rootsqwdir=w mtdparts=phys_mapped_flash:640k(boot0),640k(boot1),64k(eeprom)
bootdelay=0
I copied the value for bootcmd
and appended single
which told the kernel to boot to single user mode.
The actual command I ran at the U-Boot console was:
fatload mmc 0 $(loadaddr) vmlinux.64;bootoctlinux $(loadaddr) numcores=2 endbootargs mem=0 root=/dev/mmcblk0p2 rootdelay=10 rw rootsqimg=squashfs.img rootsqwdir=w mtdparts=phys_mapped_flash:640k(boot0),640k(boot1),64k(eeprom) single
Once in single user mode I uninstalled the deb package using dpkg
and then rebooted.
dpkg --remove wireguard
shutdown -r now
If you're on a Unifi-enabled board you'll get provisioning errors on when the Unifi controller tries to commit a config that specifies a WireGuard interface (assuming you persisted the WireGuard config using a config.gateway.json
file on the controller). Simply ignore that and then install the working version and let the controller re-provision the device now that it'll know what a wireguard
interface type is.
Thanks for the report. I'll look into it.
I'm happy to test basic install, reboot and simple functionality on the hardware I have. ER-X-SFP and an ERX-6P, these run the 1.10 branch of firmware.
If you've got a working toolchain, would you building with this patch and let me know if that "fixes" it?
diff --git a/src/compat/compat.h b/src/compat/compat.h
index 7a61e4c1..7c2d5125 100644
--- a/src/compat/compat.h
+++ b/src/compat/compat.h
@@ -466,11 +466,13 @@ static inline void *kvmalloc_ours(size_t size, gfp_t flags)
{
gfp_t kmalloc_flags = flags;
void *ret;
+#ifndef CONFIG_CAVIUM_OCTEON_IPFWD_OFFLOAD
if (size > PAGE_SIZE) {
kmalloc_flags |= __GFP_NOWARN;
if (!(kmalloc_flags & __GFP_REPEAT) || (size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
kmalloc_flags |= __GFP_NORETRY;
}
+#endif
ret = kmalloc(size, kmalloc_flags);
if (ret || size <= PAGE_SIZE)
return ret;
Same issue on my ER-4 with FW v2.0.0. I ran make deb-e300
from commit 2877098c743eb5ca74ded644a108f592728c2876 of the v2.0 branch. Had to use the reset button and restore a backup.
[** ] A start job is running for UBNT Routing Daemons (57s / no limit)CPU 2 Unable to handle kernel paging request at virtual address 0000000400000000, epc == ffffffff80956b74, ra == 8
Oops[#1]:
CPU: 2 PID: 3995 Comm: ip Tainted: P O 4.9.79-UBNT #1
task: 800000004d322700 task.stack: 800000004421c000
$ 0 : 0000000000000000 0000000000000000 ffffffff80f70000 ffffffff80def658
$ 4 : 0000000400000000 0000000000000002 0000000000000000 ffffffffc056bd48
$ 8 : 000000006239a4de ffffffff80def658 da451be76a5f3a20 a7fdf6cb8743060e
$12 : 0000000000000000 ffffffff80ab969c 0000000028bcd81f 800000004d01bda8
$16 : 0000000400000000 ffffffff808c0000 00000000024012c0 0000000000000001
$20 : 800000004d01b780 ffffffffc0570000 ffffffff80e1eb00 ffffffffc0581e90
$24 : 000000001215c592 ffffffffd8a70a1c
$28 : 800000004421c000 800000004421f7a0 800000004421f830 ffffffffc056bd48
Hi : 0000000000000006
Lo : ccccccccccccccd7
epc : ffffffff80956b74 kmem_cache_alloc+0x34/0x160
ra : ffffffffc056bd48 wg_pubkey_hashtable_alloc+0x28/0xe8 [wireguard]
Status: 10009ce3 KX SX UX KERNEL EXL IE
Cause : 00800008 (ExcCode 02)
BadVA : 0000000400000000
PrId : 000d9602 (Cavium Octeon III)
Modules linked in: wireguard(O) ip6_udp_tunnel udp_tunnel 8021q garp stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_NETMAP xt_set nf_log_ipv4 ipt_REJECT nf_reject_ipv4 nf_log_ipv6 nf_l6
Process ip (pid: 3995, threadinfo=800000004421c000, task=800000004d322700, tls=00000000770cb490)
Stack : ffffffff80956b40 ffffffff808c0000 ffffffff808bbb10 ffffffffc056bd48
800000004d01b000 ffffffffc056572c 0000000000000003 800000004d01b000
ffffffff80e1eb00 8000000047cf0000 0000000000000000 800000004421f830
0000000000000000 ffffffff80c1223c 0000000000000000 0000000000000000
8000000047cf0000 ffffffff80c11d3c 0000000000000000 0000000000000000
0000000000000000 8000000047cf0020 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
8000000047cf0028 0000000000000000 0000000000000000 0000000000000000
...
Call Trace:
[<ffffffff80956b74>] kmem_cache_alloc+0x34/0x160
[<ffffffffc056bd48>] wg_pubkey_hashtable_alloc+0x28/0xe8 [wireguard]
[<ffffffffc056572c>] wg_newlink+0xdc/0x3e0 [wireguard]
[<ffffffff80c1223c>] rtnl_newlink+0x674/0x750
Code: 00a0902d 0060482d 9f850018 <de020000> 000528f8 7c652a0a 64420008 7c45620a 9f880018
---[ end trace d08fbf877d376bec ]---
Kernel panic - not syncing: Fatal exception
Rebooting in 60 seconds..
@zx2c4 I tried your patch but it didn't help on my ER-4 (v2.0.0, kernel 4.9.79).
I changed your #ifndef
to #if !defined(CONFIG_CAVIUM_OCTEON_IPFWD_OFFLOAD) && !defined(CONFIG_CAVIUM_IPFWD_OFFLOAD)
since it looks like the config name changed in the new kernel (verified with #error
that the block wasn't compiled in), but still the same panic when I create a wireguard device.
@Lochnair wireguard-v2.0-e300-0.0.20190227-1.deb from the 0.0.20190227 github release panics for me, you may want to pull the v2.0 binaries too.
Alright let's take it a step further then and use an entirely different allocator and see if that makes the problem go away. Then at least we'll have some idea of what we're looking at:
diff --git a/src/compat/compat.h b/src/compat/compat.h
index 7a61e4c1..cbf9427a 100644
--- a/src/compat/compat.h
+++ b/src/compat/compat.h
@@ -464,6 +464,7 @@ static inline __be32 our_inet_confirm_addr(struct net *net, struct in_device *in
#include <linux/slab.h>
static inline void *kvmalloc_ours(size_t size, gfp_t flags)
{
+#ifndef CONFIG_CAVIUM_OCTEON_IPFWD_OFFLOAD
gfp_t kmalloc_flags = flags;
void *ret;
if (size > PAGE_SIZE) {
@@ -474,6 +475,7 @@ static inline void *kvmalloc_ours(size_t size, gfp_t flags)
ret = kmalloc(size, kmalloc_flags);
if (ret || size <= PAGE_SIZE)
return ret;
+#endif
return __vmalloc(size, flags, PAGE_KERNEL);
}
static inline void *kvzalloc_ours(size_t size, gfp_t flags)
Is this the right firmware for that stacktrace, btw? https://dl.ubnt.com/firmwares/edgemax/v2.0.x/ER-e300.v2.0.0.5155284.tar
@zx2c4 Thanks, it looks like this patch works!
For the 4.9 kernel I changed your patch slightly, since the _OCTEON was removed from the config name (and code was moved from arch/mips/cavium-octeon to drivers/net/ethernet/cavium/octeon)
diff --git a/src/compat/compat.h b/src/compat/compat.h
index 7a61e4c..0131d22 100644
--- a/src/compat/compat.h
+++ b/src/compat/compat.h
@@ -464,6 +464,7 @@ static inline __be32 our_inet_confirm_addr(struct net *net, struct in_device *in
#include <linux/slab.h>
static inline void *kvmalloc_ours(size_t size, gfp_t flags)
{
+#if !defined(CONFIG_CAVIUM_OCTEON_IPFWD_OFFLOAD) && !defined(CONFIG_CAVIUM_IPFWD_OFFLOAD)
gfp_t kmalloc_flags = flags;
void *ret;
if (size > PAGE_SIZE) {
@@ -474,6 +475,7 @@ static inline void *kvmalloc_ours(size_t size, gfp_t flags)
ret = kmalloc(size, kmalloc_flags);
if (ret || size <= PAGE_SIZE)
return ret;
+#endif
return __vmalloc(size, flags, PAGE_KERNEL);
}
static inline void *kvzalloc_ours(size_t size, gfp_t flags)
Yes, that's the right firmware for my stacktrace (but @NimlothPL's earlier in the thread is for a different firmware/kernel/hardware).
Ubiquiti still hasn't updated their downloads page for v2.0, nor provided a final GPL archive, so I'm building with kernel source from v2.0.0/master
branch of @Lochnair's kernel_e300 repo (based on the ubnt's 2.0.0-beta2 GPL release)
Do you need CONFIG_CAVIUM_IPFWD_OFFLOAD
specified in the other part of compat.h where we special case weird offloading logic?
I didn't touch that part of compat.h when building, but it looks like CONFIG_CAVIUM_IPFWD_OFFLOAD
should be included there too. (all I've tested so far is simple pings that probably don't touch the offload engine)
In skbuff.h
, struct cvm_packet_info cvm_info;
is added to sk_buff
for #ifdef CONFIG_CAVIUM_NET_PACKET_FWD_OFFLOAD
I didn't touch that part of compat.h when building, but it looks like CONFIG_CAVIUM_IPFWD_OFFLOAD should be included there too. (all I've tested so far is simple pings that probably don't touch the offload engine)
Before I add it, I'd be very grateful if you could do some comparison to show that it's the right thing to do.
Also, with regards to the real bug here, we now know there's something gravely wrong with the slab allocator (kmalloc_caches[15] is an invalid pointer), but we don't know why or how to mitigate that. Think you could send me the output of cat /proc/slabinfo
?
For the 4.9 kernel I changed your patch slightly
Woah woah are you saying that this bug is present on their 4.9 kernel too? Not just their 3.10? Or did you not actually try to trigger it on the 4.9 yet?
Before I add it, I'd be very grateful if you could do some comparison to show that it's the right thing to do.
Checking that now and doing some iperf3 benchmarking.
are you saying that this bug is present on their 4.9 kernel too?
Yep, all of my building/testing today has been on the 4.9 kernel, I don't have 3.10 running on anything (and it'd probably be tricky to downgrade)
Gotcha, thanks for clarifying. I've been looking at the wrong kernel sources! Awaiting cat /proc/slabinfo
when you have a chance.
Here's /proc/slabinfo
. wireguard is loaded and configured with only the allocator change make to compat.h
(not skb_scrub_packet
)
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
nf_conntrack_expect 0 0 224 18 1 : tunables 0 0 0 : slabdata 0 0 0
nf_conntrack 156 315 384 21 2 : tunables 0 0 0 : slabdata 15 15 0
ip6-frags 0 0 200 20 1 : tunables 0 0 0 : slabdata 0 0 0
tw_sock_TCPv6 16 16 248 16 1 : tunables 0 0 0 : slabdata 1 1 0
request_sock_TCPv6 0 0 304 26 2 : tunables 0 0 0 : slabdata 0 0 0
TCPv6 64 64 2048 16 8 : tunables 0 0 0 : slabdata 4 4 0
cfq_queue 68 68 240 17 1 : tunables 0 0 0 : slabdata 4 4 0
mqueue_inode_cache 18 18 896 18 4 : tunables 0 0 0 : slabdata 1 1 0
fat_inode_cache 0 0 656 24 4 : tunables 0 0 0 : slabdata 0 0 0
fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
squashfs_inode_cache 2925 2925 640 25 4 : tunables 0 0 0 : slabdata 117 117 0
jbd2_transaction_s 64 64 256 16 1 : tunables 0 0 0 : slabdata 4 4 0
jbd2_journal_handle 340 340 48 85 1 : tunables 0 0 0 : slabdata 4 4 0
jbd2_journal_head 340 340 120 34 1 : tunables 0 0 0 : slabdata 10 10 0
jbd2_revoke_table_s 256 256 16 256 1 : tunables 0 0 0 : slabdata 1 1 0
jbd2_revoke_record_s 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
ext2_inode_cache 0 0 712 23 4 : tunables 0 0 0 : slabdata 0 0 0
ext4_inode_cache 306 306 936 17 4 : tunables 0 0 0 : slabdata 18 18 0
ext4_allocation_context 128 128 128 32 1 : tunables 0 0 0 : slabdata 4 4 0
ext4_system_zone 102 102 40 102 1 : tunables 0 0 0 : slabdata 1 1 0
ext4_io_end 384 384 64 64 1 : tunables 0 0 0 : slabdata 6 6 0
ext4_extent_status 510 510 40 102 1 : tunables 0 0 0 : slabdata 5 5 0
mbcache 0 0 56 73 1 : tunables 0 0 0 : slabdata 0 0 0
dio 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
posix_timers_cache 18 18 216 18 1 : tunables 0 0 0 : slabdata 1 1 0
UNIX 224 224 1152 28 8 : tunables 0 0 0 : slabdata 8 8 0
ip4-frags 44 44 184 22 1 : tunables 0 0 0 : slabdata 2 2 0
flow_cache 144 144 112 36 1 : tunables 0 0 0 : slabdata 4 4 0
tw_sock_TCP 64 64 248 16 1 : tunables 0 0 0 : slabdata 4 4 0
request_sock_TCP 104 104 304 26 2 : tunables 0 0 0 : slabdata 4 4 0
TCP 68 68 1920 17 8 : tunables 0 0 0 : slabdata 4 4 0
hugetlbfs_inode_cache 29 29 552 29 4 : tunables 0 0 0 : slabdata 1 1 0
eventpoll_pwq 280 280 72 56 1 : tunables 0 0 0 : slabdata 5 5 0
inotify_inode_mark 184 184 88 46 1 : tunables 0 0 0 : slabdata 4 4 0
request_queue 17 17 1848 17 8 : tunables 0 0 0 : slabdata 1 1 0
blkdev_requests 552 552 344 23 2 : tunables 0 0 0 : slabdata 24 24 0
blkdev_ioc 156 156 104 39 1 : tunables 0 0 0 : slabdata 4 4 0
sock_inode_cache 300 300 640 25 4 : tunables 0 0 0 : slabdata 12 12 0
file_lock_cache 76 76 208 19 1 : tunables 0 0 0 : slabdata 4 4 0
net_namespace 0 0 5632 5 8 : tunables 0 0 0 : slabdata 0 0 0
shmem_inode_cache 2025 2025 640 25 4 : tunables 0 0 0 : slabdata 81 81 0
proc_inode_cache 1695 1728 592 27 4 : tunables 0 0 0 : slabdata 64 64 0
sigqueue 100 100 160 25 1 : tunables 0 0 0 : slabdata 4 4 0
bdev_cache 84 84 768 21 4 : tunables 0 0 0 : slabdata 4 4 0
kernfs_node_cache 10132 10132 120 34 1 : tunables 0 0 0 : slabdata 298 298 0
mnt_cache 210 210 384 21 2 : tunables 0 0 0 : slabdata 10 10 0
inode_cache 4857 5490 536 30 4 : tunables 0 0 0 : slabdata 183 183 0
dentry 23463 24696 192 21 1 : tunables 0 0 0 : slabdata 1176 1176 0
iint_cache 0 0 80 51 1 : tunables 0 0 0 : slabdata 0 0 0
buffer_head 31356 31356 104 39 1 : tunables 0 0 0 : slabdata 804 804 0
nsproxy 292 292 56 73 1 : tunables 0 0 0 : slabdata 4 4 0
files_cache 105 105 768 21 4 : tunables 0 0 0 : slabdata 5 5 0
signal_cache 396 396 896 18 4 : tunables 0 0 0 : slabdata 22 22 0
sighand_cache 153 161 4224 7 8 : tunables 0 0 0 : slabdata 23 23 0
task_struct 232 243 3328 9 8 : tunables 0 0 0 : slabdata 27 27 0
anon_vma 4736 4736 64 64 1 : tunables 0 0 0 : slabdata 74 74 0
shared_policy_node 340 340 48 85 1 : tunables 0 0 0 : slabdata 4 4 0
numa_policy 170 170 24 170 1 : tunables 0 0 0 : slabdata 1 1 0
radix_tree_node 1708 1708 584 28 4 : tunables 0 0 0 : slabdata 61 61 0
idr_layer_cache 255 255 2096 15 8 : tunables 0 0 0 : slabdata 17 17 0
kmalloc-8192 80 80 8192 4 8 : tunables 0 0 0 : slabdata 20 20 0
kmalloc-4096 1354 1808 4096 8 8 : tunables 0 0 0 : slabdata 226 226 0
kmalloc-2048 306 320 2048 16 8 : tunables 0 0 0 : slabdata 20 20 0
kmalloc-1024 1605 1664 1024 16 4 : tunables 0 0 0 : slabdata 104 104 0
kmalloc-512 3051 3552 512 16 2 : tunables 0 0 0 : slabdata 222 222 0
kmalloc-256 1738 1984 256 16 1 : tunables 0 0 0 : slabdata 124 124 0
kmalloc-192 5985 5985 192 21 1 : tunables 0 0 0 : slabdata 285 285 0
kmalloc-128 15360 15552 128 32 1 : tunables 0 0 0 : slabdata 486 486 0
kmalloc-96 7350 7350 96 42 1 : tunables 0 0 0 : slabdata 175 175 0
kmalloc-64 18221 20032 64 64 1 : tunables 0 0 0 : slabdata 313 313 0
kmalloc-32 1664 1664 32 128 1 : tunables 0 0 0 : slabdata 13 13 0
kmalloc-16 2304 2304 16 256 1 : tunables 0 0 0 : slabdata 9 9 0
kmalloc-8 6144 6144 8 512 1 : tunables 0 0 0 : slabdata 12 12 0
kmem_cache_node 128 128 64 64 1 : tunables 0 0 0 : slabdata 2 2 0
kmem_cache 80 80 256 16 1 : tunables 0 0 0 : slabdata 5 5 0
Rebuilt wireguard with skb_scrub_packet
patched for CONFIG_CAVIUM_IPFWD_OFFLOAD
and it works too.
iperf3 might be slightly faster when terminating wireguard in the ER4 and then forwarding to a LAN host with the skb_scrub_packet
patch, but it was pretty close.
This is a bit of a frustrating situation as I don't have things setup to keep trying stuff, so it's quite hard to debug, and the octeon kernel won't build for qemu. If you've got a lot of patience, there are a million things I'm curious about in trying to track this bug down. For example:
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 622f6b6ae..29861409a 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -980,6 +980,7 @@ static void __init new_kmalloc_cache(int idx, unsigned long flags)
{
kmalloc_caches[idx] = create_kmalloc_cache(kmalloc_info[idx].name,
kmalloc_info[idx].size, flags);
+ pr_err("SARU making cache %d is 0x%llx called %s size %lu flags 0x%x\n", idx, kmalloc_caches[idx], kmalloc_info[idx].name, kmalloc_info[idx].size, flags);
}
/*
@@ -992,6 +993,7 @@ void __init create_kmalloc_caches(unsigned long flags)
int i;
for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
+ pr_err("SARU iteration %d, pre-state: 0x%llx\n", i, kmalloc_caches[i]);
if (!kmalloc_caches[i])
new_kmalloc_cache(i, flags);
Got IRC or something? Might be easier to work through it there, if you're up for that.
I can dig up an IRC client, but I'm not super comfortable testing out kernel patches. When I soft-bricked at first, I wasn't able to break into a bootloader shell and don't know what would happen if I got stuck with an unbootable kernel.
Happy to test out wireguard patches as long as my roommate's not using the internet.
P.S. I sympathize with the struggle of debugging without hardware, and really appreciate your help on this issue!
Okay what if you patch wireguard with the below and see at which point it crashes (i.e. send me the whole dmesg output):
diff --git a/src/main.c b/src/main.c
index 4b5b58e8..cda15a94 100644
--- a/src/main.c
+++ b/src/main.c
@@ -20,8 +20,20 @@
static int __init mod_init(void)
{
+ unsigned long i;
+ void *ohnose;
int ret;
+ for (i = 0; i < ilog2(0x100000000); ++i) {
+ pr_err("About to allocate size %lu, index %d", 1UL << i, kmalloc_index(1UL << i));
+ ohnose = kmalloc(1UL << i, GFP_KERNEL);
+ if (!ohnose) {
+ pr_err("Allocation failed at size %lu\n", 1UL << i);
+ break;
+ }
+ kfree(ohnose);
+ }
+
if ((ret = chacha20_mod_init()) || (ret = poly1305_mod_init()) ||
(ret = chacha20poly1305_mod_init()) || (ret = blake2s_mod_init()) ||
(ret = curve25519_mod_init()))
Sure, I can try that out (as soon as I can find a reasonable maintenance window). One issue is that systemd seems to capture most of the kernel output once it starts so the prints before the panic might get dropped. I'll play around with printk levels to see if I can make them hit the console unconditionally.
Those are pr_err prints, so they should be somewhat unconditional.
I wasn't aware edgemax had moved to systemd.
Yeah, EdgeOS v2.0 switched to Debian Stretch with systemd. Here's the output after insmod with the kmalloc patch. Interestingly it didn't panic in this context. I did rmmod wireguard
then insmod /tmp/wireguard.ko
.
Here's the dmesg output starting after the insmod. Did you want the full log starting at boot?
[94275.974092] wireguard: About to allocate size 1, index 5
[94275.977934] wireguard: About to allocate size 2, index 5
[94275.981942] wireguard: About to allocate size 4, index 5
[94275.985803] wireguard: About to allocate size 8, index 5
[94275.989814] wireguard: About to allocate size 16, index 5
[94275.993733] wireguard: About to allocate size 32, index 5
[94275.997839] wireguard: About to allocate size 64, index 6
[94276.001759] wireguard: About to allocate size 128, index 7
[94276.005948] wireguard: About to allocate size 256, index 8
[94276.009955] wireguard: About to allocate size 512, index 9
[94276.014144] wireguard: About to allocate size 1024, index 10
[94276.018324] wireguard: About to allocate size 2048, index 11
[94276.022679] wireguard: About to allocate size 4096, index 12
[94276.026867] wireguard: About to allocate size 8192, index 13
[94276.031223] wireguard: About to allocate size 16384, index 14
[94276.035506] wireguard: About to allocate size 32768, index 15
[94276.039951] wireguard: About to allocate size 65536, index 16
[94276.044235] wireguard: About to allocate size 131072, index 17
[94276.048768] wireguard: About to allocate size 262144, index 18
[94276.053128] wireguard: About to allocate size 524288, index 19
[94276.057679] wireguard: About to allocate size 1048576, index 20
[94276.062147] wireguard: About to allocate size 2097152, index 21
[94276.066814] wireguard: About to allocate size 4194304, index 22
[94276.071356] wireguard: About to allocate size 8388608, index 23
[94276.076194] wireguard: About to allocate size 16777216, index 24
[94276.081217] wireguard: About to allocate size 33554432, index 25
[94276.087004] wireguard: About to allocate size 67108864, index 26
[94276.091534] ------------[ cut here ]------------
[94276.094880] WARNING: CPU: 0 PID: 19738 at mm/page_alloc.c:3544 __alloc_pages_nodemask+0x2f8/0xca8
[94276.102452] Modules linked in: wireguard(O+) sch_fq_codel sch_htb xt_nat xt_multiport ip6_udp_tunnel udp_tunnel 8021q garp stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_NETMAP xt_set nf_log_ipv4 ipt_REJECT nf_reject_ipv4 nf_log_ipv6 nf_log_common nf_conntrack_ipv6 nf_defrag_ipv6 xt_LOG xt_tcpudp xt_comment xt_conntrack ip_set_bitmap_port ip6table_mangle ip6table_filter ip6table_raw ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_mangle xt_CT iptable_raw nf_nat_h323 nf_conntrack_h323 nf_nat_sip nf_conntrack_sip nf_nat_tftp nf_nat_ftp nf_conntrack_tftp nf_conntrack_ftp ip_set_hash_net ip_set nfnetlink iptable_filter cvm_ipsec_kame(O) imq cavium_ip_offload(O) ubnt_nf_app(O) tdts(PO) octeon_rng rng_core nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre
[94276.172413] nf_nat nf_conntrack ubnt_platform(PO) ip_tables x_tables ipv6 [last unloaded: wireguard]
[94276.180422] CPU: 0 PID: 19738 Comm: insmod Tainted: P O 4.9.79-UBNT #1
[94276.186772] Stack : 0000000000000000 0000000000000004 0000000000000006 0000000000000000
[94276.193528] ffffffff80e00000 ffffffff80f65eb0 ffffffff80f60000 ffffffff80e00000
[94276.200283] 0000000000000000 0000000000000000 0000000000000047 0000000000000000
[94276.207037] ffffffff80f60000 ffffffff808c07c8 0000000000000004 ffffffff808c18c8
[94276.213791] 0000000000000000 0000000000000000 0000000000000000 ffffffff80f60000
[94276.220545] ffffffff80d7a468 ffffffff80df3f07 8000000046418d00 ffffffff80f5c300
[94276.227300] 0000000000004d1a 0000000000000000 0000000000100001 ffffffff808fae64
[94276.234054] ffffffff808e7b20 8000000047cbb860 8000000047cbb978 ffffffff80aa9234
[94276.240809] 0000000000000000 ffffffff808c2000 000000000000000a ffffffff80d7a468
[94276.247563] 0000000000000000 ffffffff808601c8 0000000000000000 0000000000000000
[94276.254318] ...
[94276.255482] Call Trace:
[94276.256631] [<ffffffff808601c8>] show_stack+0x90/0xb0
[94276.260383] [<ffffffff80aa9234>] dump_stack+0x84/0xc0
[94276.264134] [<ffffffff8087eb08>] __warn+0x100/0x118
[94276.267712] [<ffffffff809066e8>] __alloc_pages_nodemask+0x2f8/0xca8
[94276.272681] [<ffffffff80922e54>] kmalloc_order+0x14/0x80
[94276.276728] [<ffffffffc05c7250>] mod_init+0x250/0x3b4 [wireguard]
[94276.281535] [<ffffffff80800610>] do_one_initcall+0x40/0x140
[94276.285809] [<ffffffff808fb2ac>] do_init_module+0x64/0x1b4
[94276.289995] [<ffffffff808eaa4c>] load_module+0x1dcc/0x2090
[94276.294177] [<ffffffff808eafc4>] SyS_finit_module+0xcc/0xf0
[94276.298449] [<ffffffff8086deec>] syscall_common+0x18/0x3c
[94276.302616] ---[ end trace 3be245c725359407 ]---
[94276.305945] wireguard: Allocation failed at size 67108864
[94276.310101] wireguard: WireGuard 0.0.20190227 loaded. See www.wireguard.com for information.
[94276.317258] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
has there been any progress on this? I am happy to test packages (assuming no risk of bricking my ER-6P, it has a serial port on it but not sure how far I can screw things up) and if someone can point me in the right direction to setting up a compile toolchain I will gladly assist in this too.
@phillipmcmahon The patch in this comment (my modification of this one for v2.0) seems to fix the kernel panics, but I don't think a proper root cause has been found.
If you're willing to rebuild your kernel you could test out Jason's debugging patch here, but custom kernels are riskier than just testing the wireguard module, and I'm not sure what the exact recovery procedure would be.
If all goes well I should shortly have possession of an ERL. Any specific firmware I should be using?
If all goes well I should shortly have possession of an ERL. Any specific firmware I should be using?
I don't know if the Ubiquiti folks would share the adoption numbers (if they have them) but my gut feeling is that with all the issues of the v2.0 firmware most folks are still running v1.10.x on their production setups and therefore would be a good starting point to focus on.
+1 for for focusing on fw v1.10.x. v2.0.0 makes grown men cry. v2.0.1 not in sight yet.
I agree 1.10.x is probably more common and thus a good starting point, but @phillipmcmahon @dlpwx what's so bad about 2.0.0? I've been running since it came out and it's been totally solid.
I agree 1.10.x is probably more common and thus a good starting point, but @phillipmcmahon @dlpwx what's so bad about 2.0.0? I've been running since it came out and it's been totally solid.
Terrible issues on the ER-X series, hardware reboots, hwnat-ing not working, igmp-proxy not working to name the issues I have had with my particular set up. I just needed a working set up so went back to v1.10.x
Then following the release thread on the forum it seems, at least by volume, to be the most problematic release in recent history for many many folks. Bricked units, partially working configs etc.
It seemed to leave beta whilst users were still reporting serious issues, not sure of what pressures they were experiencing to suddenly make it live as they did. Interesting it has also not received even a point update so far. I will wait until the forum gods announce this is good for daily use before I go back to it.
Is there any progress on this, happy to help/test etc. as needed.
Ping. Offering to help, things seem to have gone very quiet.
Quiet, yes, but not forgotten. Lots of unexpected travel precluding my access to the hardware right now. I'd suggest @Lochnair apply the workaround I posted above to his builds until I'm back home and can figure out what UBNT is doing to their kernels.
Appreciate the response, and also to know at some point things will pick up again. There has been another release of WireGuard in the meantime, v0.0.20190406.
Indeed. I'm the one who made that release :)
I don't expect it will fix the kmalloc problem, though.
haha, my bad. I should know whom I am talking with next time :)
I can confirm 0406 still crashes without the patch
I am willing to test on ER-4 EDGEOS FW 2.0.1 if deb packages go back up again.
Fingers crossed, installing now on my 6P...
Update: Installed, rebooted and it all came back up and within these first few minutes it looks ok. My WireGuard client connected without issue and traffic is-a-flowing. I will keep hammering it this evening and see if something "bad" happens.
Early to call it, but thanks a lot.
Several GB have passed through the multiple WG interfaces I have installed on my 6P. All looks pretty solid. No issues noted as of yet.
Thanks for the build! The 2.0 package seems sane on my ER4 v2.0.1
Allen,
Can you share the relevant portions of your config? I can get the tunnel to activate but the routes aren't getting pushed. This is my first time setting up Wireguard, see config. Thanks in advance.
Client config:
[Interface]
PrivateKey =
[Peer]
PublicKey =
ER4 config:
name WAN_LOCAL { default-action drop description "WAN to router" rule 31 { action accept description wireguard destination { port 53922 } log enable protocol udp state { established enable invalid disable new enable related enable }
wireguard wg0 {
address 192.168.10.1/24
listen-port 53922
peer
nat { rule 5011 { description "Masquerade for wg0" outbound-interface wg0 protocol all type masquerade }
Corey Hines Systems Engineer Arista Networks m 612-209-6550 o 408-547-8075 chines@arista.com TAC: support@arista.com www.arista.com Arista EOS: A Tale of Opposite Architectures https://www.youtube.com/watch?v=Hfwr6sY27hA&authuser=1 Download the EOS Configuration Manual https://www.arista.com/assets/data/docs/Manuals/EOS-4.15.4F-Manual.pdf Install vEOS-lab for testing & training https://eos.arista.com/running-veos-on-esxi-5-5/
On Mon, Apr 15, 2019 at 6:25 PM Allen Wild notifications@github.com wrote:
Thanks for the build! The 2.0 package seems sane on my ER4 v2.0.1
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Lochnair/vyatta-wireguard/issues/97#issuecomment-483454798, or mute the thread https://github.com/notifications/unsubscribe-auth/AH5-LR5Dw6hitkvv9Ja_A3FmBtXgTYviks5vhQpygaJpZM4bZoHe .
Corey -- try your configuration for the peer without the ipv6 default network. I've had a problem with this the last few versions and have had to use a script to add it after the link is up using the wg command directly. For some reason on the ER's if the ::/0 (or 0::/0) is present in the saved config it doesn't work.
Hello, I have been using these releases with great success for months. I installed the 2/27 build yesterday and upon restart I receive a kernel panic with the updated version. I reset the device to defaults and installed again and received the same issue. I downgraded to the previous release and everything works as expected. Anyone got any pointers to help troubleshoot? Is it just a bad build for the USG3P? Any advice or assistance would be appreciated!