Kernel Panic on 2/27 build with USG

paulg1981 commented 5 years ago

Hello, I have been using these releases with great success for months. I installed the 2/27 build yesterday and upon restart I receive a kernel panic with the updated version. I reset the device to defaults and installed again and received the same issue. I downgraded to the previous release and everything works as expected. Anyone got any pointers to help troubleshoot? Is it just a bad build for the USG3P? Any advice or assistance would be appreciated!

Dr-Escher commented 5 years ago

Same issue after upgrading to the latest release. The device has been stuck in a reboot loop with occasional ping responses in between.

Package: wireguard-e50-0.0.20190227-1 Device: ER-X-SFP Firmware: EdgeOS v1.10.9.5166958.190213.1952

phillipmcmahon commented 5 years ago

Same issue for me on a ER-6P, I upgraded remotely and now the unit it down, no Internet at the site. Once I get serial access I can post more info.

What testing is done on these packages prior to being released?

NimlothPL commented 5 years ago

Mar  2 14:33:13 USG3P kernel: CPU 1 Unable to handle kernel paging request at virtual address 0000000000000000, epc == ffffffffc012ced8, ra == ffffffffc0b9314c
Mar  2 14:33:13 USG3P kernel: Oops[#1]:
Mar  2 14:33:13 USG3P kernel: CPU: 1 PID: 4103 Comm: ip Tainted: P           O 3.10.107-UBNT #1
Mar  2 14:33:13 USG3P kernel: task: 800000041c20e0e0 ti: 800000000c030000 task.ti: 800000000c030000
Mar  2 14:33:13 USG3P kernel: $ 0   : 0000000000000000 0000000000000004 ffffffffc0660000 ffffffffc050b3e8
Mar  2 14:33:13 USG3P kernel: $ 4   : 0000000000000001 00000000000012d0 ffffffffc0b9314c 800000000c033670
Mar  2 14:33:13 USG3P kernel: $ 8   : ffffffffffffff9d 800000041d296cc0 ffffffffc050b3e8 000000001a5f4728
Mar  2 14:33:13 USG3P kernel: $12   : 0000000000000008 ffffffffc025c878 ffffffffd76c0898 0000000000000000
Mar  2 14:33:13 USG3P kernel: $16   : 800000041d296000 0000000000000000 00000000000012d0 ffffffffc0531980
Mar  2 14:33:13 USG3P kernel: $20   : 800000041db09e10 800000041d296000 0000000000000000 ffffffffc080a380
Mar  2 14:33:13 USG3P kernel: $24   : 0000000005733924 0000000027f2031c
Mar  2 14:33:13 USG3P kernel: $28   : 800000000c030000 800000000c033710 800000000c033780 ffffffffc0b9314c
Mar  2 14:33:13 USG3P kernel: Hi    : 0000000000000000
Mar  2 14:33:13 USG3P kernel: Lo    : 1dcbc89e99000000
Mar  2 14:33:13 USG3P kernel: epc   : ffffffffc012ced8 kmem_cache_alloc+0x30/0x150
Mar  2 14:33:13 USG3P kernel:    Tainted: P           O
Mar  2 14:33:13 USG3P kernel: ra    : ffffffffc0b9314c wg_pubkey_hashtable_alloc+0x1c/0xd8 [wireguard]
Mar  2 14:33:13 USG3P kernel: Status: 10008ce3  KX SX UX KERNEL EXL IE
Mar  2 14:33:13 USG3P kernel: Cause : 00800008
Mar  2 14:33:13 USG3P kernel: BadVA : 0000000000000000
Mar  2 14:33:13 USG3P kernel: PrId  : 000d0601 (Cavium Octeon+)
Mar  2 14:33:13 USG3P kernel: Modules linked in: wireguard(O) ip_tunnel xt_mark xt_nat 8021q garp stp llc ipt_MASQUERADE xt_set nf_conntrack_ipv6 nf_defrag_ipv6 xt_comment xt_conntrack ip_set_bitmap_port xt_TCPMSS xt_tcpudp ip6table_mangle ip6table_filter ip6table_raw ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_mangle xt_CT iptable_raw nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_h323 nf_conntrack_h323 nf_nat_proto_gre nf_nat_tftp nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables ip_set_hash_net ip_set nfnetlink configfs unifigpio(PO) unifihal(PO) cvm_ipsec_kame(O) ipv6 imq cavium_ip_offload(PO) ubnt_nf_app(PO) tdts(PO) octeon_rng rng_core octeon_ethernet mdio_octeon ethernet_mem octeon_common of_mdio ubnt_platform(PO) libphy [last unloaded: nf_conntrack_sip]
Mar  2 14:33:13 USG3P kernel: Process ip (pid: 4103, threadinfo=800000000c030000, task=800000041c20e0e0, tls=0000000077a5b490)
Mar  2 14:33:13 USG3P kernel: Stack : 800000041d296000 800000041d296680 800000000c033780 ffffffffc0b9314c
Mar  2 14:33:13 USG3P kernel:     800000041d296000 ffffffffc0b8d01c 800000041db09e00 800000041db09e00
Mar  2 14:33:13 USG3P kernel:     ffffffffc0531980 800000000c033780 ffffffffc0531980 ffffffffc0346a5c
Mar  2 14:33:13 USG3P kernel:     800000000c033780 ffffffffc0346768 0000000000000000 0000000000000000
Mar  2 14:33:13 USG3P kernel:     0000000000000000 800000041db09e20 0000000000000000 0000000000000000
Mar  2 14:33:13 USG3P kernel:     0000000000000000 0000000000000000 0000000000000000 0000000000000000
Mar  2 14:33:13 USG3P kernel: last message repeated 2 times
Mar  2 14:33:13 USG3P kernel:     800000041db09e28 0000000000000000 0000000000000000 0000000000000000
Mar  2 14:33:13 USG3P kernel:     0000000000000000 0000000000000000 0000000000000000 0000000000000000
Mar  2 14:33:13 USG3P kernel:     ...
Mar  2 14:33:13 USG3P kernel: Call Trace:
Mar  2 14:33:13 USG3P kernel: [<ffffffffc012ced8>] kmem_cache_alloc+0x30/0x150
Mar  2 14:33:13 USG3P kernel: [<ffffffffc0b9314c>] wg_pubkey_hashtable_alloc+0x1c/0xd8 [wireguard]
Mar  2 14:33:13 USG3P kernel: [<ffffffffc0b8d01c>] wg_newlink+0xac/0x3c8 [wireguard]
Mar  2 14:33:13 USG3P kernel: [<ffffffffc0346a5c>] rtnl_newlink+0x434/0x538
Mar  2 14:33:13 USG3P kernel:
Mar  2 14:33:13 USG3P kernel:
Mar  2 14:33:13 USG3P kernel: Code: 0080882d  ffb00000  9f840020 <de220000> 000420f8  0064202d  dc840000  0044382d  dcec0008
Mar  2 14:33:13 USG3P kernel: ---[ end trace 0588e2b9fdef1fd0 ]---

phillipmcmahon commented 5 years ago

Seems to be quite an issue. Maybe pull this release until more is known why this is happening.

Lochnair commented 5 years ago

@phillipmcmahon Agreed. I've pulled the 1.10 packages for now. As for testing before release - most of the time, there is none, as I don't really have equipment to test on.

@NimlothPL Thanks for the stacktrace. Seems related to this commit. I'll ask Jason about it.

evenfowler commented 5 years ago

I was able to fix this on a USG 4 Pro with the help of single user mode.

I connected a serial console cable and then caught the U-Boot console by pressing a key before it continued booting. You should see something like:

U-Boot 2012.04.01 (UBNT Build Version: e221_002_01aa9) (Aug 17 2018 - 01:13:14)

Skipping PCIe port 0 BIST, in EP mode, can't tell if clocked.
Skipping PCIe port 1 BIST, reset not done. (port not configured)
BIST check passed.
UBNT_E220 r1:1, r2:14, serial #: 000000FFFFFF
MPR 13-02102-14
Core clock: 1000 MHz, IO clock: 600 MHz, DDR clock: 533 MHz (1066 Mhz DDR)
Base DRAM address used by u-boot: 0x8f800000, size: 0x800000
DRAM: 2 GiB
Clearing DRAM...... done
Flash: 8 MiB
Net:   octeth0, octeth1, octeth2, octeth3
MMC:   Octeon MMC/SD0: 0
USB:   USB EHCI 1.00
scanning bus for devices... 1 USB Device(s) found
Type the command 'usb start' to scan for USB storage devices.

Hit any key to stop autoboot:  0 
Octeon ubnt_e220#

Once in the U-Boot console I ran printenv to find the bootcmd value.

Octeon ubnt_e220# printenv
autoload=n
baudrate=115200
boardname=ubnt_e220
bootcmd=fatload mmc 0 $(loadaddr) vmlinux.64;bootoctlinux $(loadaddr) numcores=2 endbootargs mem=0 root=/dev/mmcblk0p2 rootdelay=10 rw rootsqimg=squashfs.img rootsqwdir=w mtdparts=phys_mapped_flash:640k(boot0),640k(boot1),64k(eeprom)
bootdelay=0

I copied the value for bootcmd and appended single which told the kernel to boot to single user mode.

The actual command I ran at the U-Boot console was:

fatload mmc 0 $(loadaddr) vmlinux.64;bootoctlinux $(loadaddr) numcores=2 endbootargs mem=0 root=/dev/mmcblk0p2 rootdelay=10 rw rootsqimg=squashfs.img rootsqwdir=w mtdparts=phys_mapped_flash:640k(boot0),640k(boot1),64k(eeprom) single

Once in single user mode I uninstalled the deb package using dpkg and then rebooted.

dpkg --remove wireguard
shutdown -r now

If you're on a Unifi-enabled board you'll get provisioning errors on when the Unifi controller tries to commit a config that specifies a WireGuard interface (assuming you persisted the WireGuard config using a config.gateway.json file on the controller). Simply ignore that and then install the working version and let the controller re-provision the device now that it'll know what a wireguard interface type is.

zx2c4 commented 5 years ago

Thanks for the report. I'll look into it.

phillipmcmahon commented 5 years ago

I'm happy to test basic install, reboot and simple functionality on the hardware I have. ER-X-SFP and an ERX-6P, these run the 1.10 branch of firmware.

zx2c4 commented 5 years ago

If you've got a working toolchain, would you building with this patch and let me know if that "fixes" it?

diff --git a/src/compat/compat.h b/src/compat/compat.h
index 7a61e4c1..7c2d5125 100644
--- a/src/compat/compat.h
+++ b/src/compat/compat.h
@@ -466,11 +466,13 @@ static inline void *kvmalloc_ours(size_t size, gfp_t flags)
 {
    gfp_t kmalloc_flags = flags;
    void *ret;
+#ifndef CONFIG_CAVIUM_OCTEON_IPFWD_OFFLOAD
    if (size > PAGE_SIZE) {
        kmalloc_flags |= __GFP_NOWARN;
        if (!(kmalloc_flags & __GFP_REPEAT) || (size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
            kmalloc_flags |= __GFP_NORETRY;
    }
+#endif
    ret = kmalloc(size, kmalloc_flags);
    if (ret || size <= PAGE_SIZE)
        return ret;

aswild commented 5 years ago

Same issue on my ER-4 with FW v2.0.0. I ran make deb-e300 from commit 2877098c743eb5ca74ded644a108f592728c2876 of the v2.0 branch. Had to use the reset button and restore a backup.

[**    ] A start job is running for UBNT Routing Daemons (57s / no limit)CPU 2 Unable to handle kernel paging request at virtual address 0000000400000000, epc == ffffffff80956b74, ra == 8
Oops[#1]:
CPU: 2 PID: 3995 Comm: ip Tainted: P           O    4.9.79-UBNT #1
task: 800000004d322700 task.stack: 800000004421c000
$ 0   : 0000000000000000 0000000000000000 ffffffff80f70000 ffffffff80def658
$ 4   : 0000000400000000 0000000000000002 0000000000000000 ffffffffc056bd48
$ 8   : 000000006239a4de ffffffff80def658 da451be76a5f3a20 a7fdf6cb8743060e
$12   : 0000000000000000 ffffffff80ab969c 0000000028bcd81f 800000004d01bda8
$16   : 0000000400000000 ffffffff808c0000 00000000024012c0 0000000000000001
$20   : 800000004d01b780 ffffffffc0570000 ffffffff80e1eb00 ffffffffc0581e90
$24   : 000000001215c592 ffffffffd8a70a1c
$28   : 800000004421c000 800000004421f7a0 800000004421f830 ffffffffc056bd48
Hi    : 0000000000000006
Lo    : ccccccccccccccd7
epc   : ffffffff80956b74 kmem_cache_alloc+0x34/0x160
ra    : ffffffffc056bd48 wg_pubkey_hashtable_alloc+0x28/0xe8 [wireguard]
Status: 10009ce3        KX SX UX KERNEL EXL IE
Cause : 00800008 (ExcCode 02)
BadVA : 0000000400000000
PrId  : 000d9602 (Cavium Octeon III)
Modules linked in: wireguard(O) ip6_udp_tunnel udp_tunnel 8021q garp stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_NETMAP xt_set nf_log_ipv4 ipt_REJECT nf_reject_ipv4 nf_log_ipv6 nf_l6
Process ip (pid: 3995, threadinfo=800000004421c000, task=800000004d322700, tls=00000000770cb490)
Stack : ffffffff80956b40 ffffffff808c0000 ffffffff808bbb10 ffffffffc056bd48
        800000004d01b000 ffffffffc056572c 0000000000000003 800000004d01b000
        ffffffff80e1eb00 8000000047cf0000 0000000000000000 800000004421f830
        0000000000000000 ffffffff80c1223c 0000000000000000 0000000000000000
        8000000047cf0000 ffffffff80c11d3c 0000000000000000 0000000000000000
        0000000000000000 8000000047cf0020 0000000000000000 0000000000000000
        0000000000000000 0000000000000000 0000000000000000 0000000000000000
        0000000000000000 0000000000000000 0000000000000000 0000000000000000
        0000000000000000 0000000000000000 0000000000000000 0000000000000000
        8000000047cf0028 0000000000000000 0000000000000000 0000000000000000
        ...
Call Trace:
[<ffffffff80956b74>] kmem_cache_alloc+0x34/0x160
[<ffffffffc056bd48>] wg_pubkey_hashtable_alloc+0x28/0xe8 [wireguard]
[<ffffffffc056572c>] wg_newlink+0xdc/0x3e0 [wireguard]
[<ffffffff80c1223c>] rtnl_newlink+0x674/0x750
Code: 00a0902d  0060482d  9f850018 <de020000> 000528f8  7c652a0a  64420008  7c45620a  9f880018

---[ end trace d08fbf877d376bec ]---
Kernel panic - not syncing: Fatal exception
Rebooting in 60 seconds..

aswild commented 5 years ago

@zx2c4 I tried your patch but it didn't help on my ER-4 (v2.0.0, kernel 4.9.79).

I changed your #ifndef to #if !defined(CONFIG_CAVIUM_OCTEON_IPFWD_OFFLOAD) && !defined(CONFIG_CAVIUM_IPFWD_OFFLOAD) since it looks like the config name changed in the new kernel (verified with #error that the block wasn't compiled in), but still the same panic when I create a wireguard device.

aswild commented 5 years ago

@Lochnair wireguard-v2.0-e300-0.0.20190227-1.deb from the 0.0.20190227 github release panics for me, you may want to pull the v2.0 binaries too.

zx2c4 commented 5 years ago

Alright let's take it a step further then and use an entirely different allocator and see if that makes the problem go away. Then at least we'll have some idea of what we're looking at:

diff --git a/src/compat/compat.h b/src/compat/compat.h
index 7a61e4c1..cbf9427a 100644
--- a/src/compat/compat.h
+++ b/src/compat/compat.h
@@ -464,6 +464,7 @@ static inline __be32 our_inet_confirm_addr(struct net *net, struct in_device *in
 #include <linux/slab.h>
 static inline void *kvmalloc_ours(size_t size, gfp_t flags)
 {
+#ifndef CONFIG_CAVIUM_OCTEON_IPFWD_OFFLOAD
    gfp_t kmalloc_flags = flags;
    void *ret;
    if (size > PAGE_SIZE) {
@@ -474,6 +475,7 @@ static inline void *kvmalloc_ours(size_t size, gfp_t flags)
    ret = kmalloc(size, kmalloc_flags);
    if (ret || size <= PAGE_SIZE)
        return ret;
+#endif
    return __vmalloc(size, flags, PAGE_KERNEL);
 }
 static inline void *kvzalloc_ours(size_t size, gfp_t flags)

zx2c4 commented 5 years ago

Is this the right firmware for that stacktrace, btw? https://dl.ubnt.com/firmwares/edgemax/v2.0.x/ER-e300.v2.0.0.5155284.tar

aswild commented 5 years ago

@zx2c4 Thanks, it looks like this patch works!

For the 4.9 kernel I changed your patch slightly, since the _OCTEON was removed from the config name (and code was moved from arch/mips/cavium-octeon to drivers/net/ethernet/cavium/octeon)

diff --git a/src/compat/compat.h b/src/compat/compat.h
index 7a61e4c..0131d22 100644
--- a/src/compat/compat.h
+++ b/src/compat/compat.h
@@ -464,6 +464,7 @@ static inline __be32 our_inet_confirm_addr(struct net *net, struct in_device *in
 #include <linux/slab.h>
 static inline void *kvmalloc_ours(size_t size, gfp_t flags)
 {
+#if !defined(CONFIG_CAVIUM_OCTEON_IPFWD_OFFLOAD) && !defined(CONFIG_CAVIUM_IPFWD_OFFLOAD)
        gfp_t kmalloc_flags = flags;
        void *ret;
        if (size > PAGE_SIZE) {
@@ -474,6 +475,7 @@ static inline void *kvmalloc_ours(size_t size, gfp_t flags)
        ret = kmalloc(size, kmalloc_flags);
        if (ret || size <= PAGE_SIZE)
                return ret;
+#endif
        return __vmalloc(size, flags, PAGE_KERNEL);
 }
 static inline void *kvzalloc_ours(size_t size, gfp_t flags)

Yes, that's the right firmware for my stacktrace (but @NimlothPL's earlier in the thread is for a different firmware/kernel/hardware).

Ubiquiti still hasn't updated their downloads page for v2.0, nor provided a final GPL archive, so I'm building with kernel source from v2.0.0/master branch of @Lochnair's kernel_e300 repo (based on the ubnt's 2.0.0-beta2 GPL release)

zx2c4 commented 5 years ago

Do you need CONFIG_CAVIUM_IPFWD_OFFLOAD specified in the other part of compat.h where we special case weird offloading logic?

aswild commented 5 years ago

I didn't touch that part of compat.h when building, but it looks like CONFIG_CAVIUM_IPFWD_OFFLOAD should be included there too. (all I've tested so far is simple pings that probably don't touch the offload engine)

In skbuff.h, struct cvm_packet_info cvm_info; is added to sk_buff for #ifdef CONFIG_CAVIUM_NET_PACKET_FWD_OFFLOAD

zx2c4 commented 5 years ago

I didn't touch that part of compat.h when building, but it looks like CONFIG_CAVIUM_IPFWD_OFFLOAD should be included there too. (all I've tested so far is simple pings that probably don't touch the offload engine)

Before I add it, I'd be very grateful if you could do some comparison to show that it's the right thing to do.

Also, with regards to the real bug here, we now know there's something gravely wrong with the slab allocator (kmalloc_caches[15] is an invalid pointer), but we don't know why or how to mitigate that. Think you could send me the output of cat /proc/slabinfo?

zx2c4 commented 5 years ago

For the 4.9 kernel I changed your patch slightly

Woah woah are you saying that this bug is present on their 4.9 kernel too? Not just their 3.10? Or did you not actually try to trigger it on the 4.9 yet?

aswild commented 5 years ago

Before I add it, I'd be very grateful if you could do some comparison to show that it's the right thing to do.

Checking that now and doing some iperf3 benchmarking.

are you saying that this bug is present on their 4.9 kernel too?

Yep, all of my building/testing today has been on the 4.9 kernel, I don't have 3.10 running on anything (and it'd probably be tricky to downgrade)

zx2c4 commented 5 years ago

Gotcha, thanks for clarifying. I've been looking at the wrong kernel sources! Awaiting cat /proc/slabinfo when you have a chance.

aswild commented 5 years ago

Here's /proc/slabinfo. wireguard is loaded and configured with only the allocator change make to compat.h (not skb_scrub_packet)

slabinfo - version: 2.1                                                                                                                                                                                                                                                                                                       
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>                                                                                                                                           
nf_conntrack_expect      0      0    224   18    1 : tunables    0    0    0 : slabdata      0      0      0
nf_conntrack         156    315    384   21    2 : tunables    0    0    0 : slabdata     15     15      0
ip6-frags              0      0    200   20    1 : tunables    0    0    0 : slabdata      0      0      0
tw_sock_TCPv6         16     16    248   16    1 : tunables    0    0    0 : slabdata      1      1      0
request_sock_TCPv6      0      0    304   26    2 : tunables    0    0    0 : slabdata      0      0      0
TCPv6                 64     64   2048   16    8 : tunables    0    0    0 : slabdata      4      4      0
cfq_queue             68     68    240   17    1 : tunables    0    0    0 : slabdata      4      4      0
mqueue_inode_cache     18     18    896   18    4 : tunables    0    0    0 : slabdata      1      1      0
fat_inode_cache        0      0    656   24    4 : tunables    0    0    0 : slabdata      0      0      0
fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
squashfs_inode_cache   2925   2925    640   25    4 : tunables    0    0    0 : slabdata    117    117      0
jbd2_transaction_s     64     64    256   16    1 : tunables    0    0    0 : slabdata      4      4      0
jbd2_journal_handle    340    340     48   85    1 : tunables    0    0    0 : slabdata      4      4      0
jbd2_journal_head    340    340    120   34    1 : tunables    0    0    0 : slabdata     10     10      0
jbd2_revoke_table_s    256    256     16  256    1 : tunables    0    0    0 : slabdata      1      1      0
jbd2_revoke_record_s      0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
ext2_inode_cache       0      0    712   23    4 : tunables    0    0    0 : slabdata      0      0      0
ext4_inode_cache     306    306    936   17    4 : tunables    0    0    0 : slabdata     18     18      0
ext4_allocation_context    128    128    128   32    1 : tunables    0    0    0 : slabdata      4      4      0
ext4_system_zone     102    102     40  102    1 : tunables    0    0    0 : slabdata      1      1      0
ext4_io_end          384    384     64   64    1 : tunables    0    0    0 : slabdata      6      6      0
ext4_extent_status    510    510     40  102    1 : tunables    0    0    0 : slabdata      5      5      0
mbcache                0      0     56   73    1 : tunables    0    0    0 : slabdata      0      0      0
dio                    0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
posix_timers_cache     18     18    216   18    1 : tunables    0    0    0 : slabdata      1      1      0
UNIX                 224    224   1152   28    8 : tunables    0    0    0 : slabdata      8      8      0
ip4-frags             44     44    184   22    1 : tunables    0    0    0 : slabdata      2      2      0
flow_cache           144    144    112   36    1 : tunables    0    0    0 : slabdata      4      4      0
tw_sock_TCP           64     64    248   16    1 : tunables    0    0    0 : slabdata      4      4      0
request_sock_TCP     104    104    304   26    2 : tunables    0    0    0 : slabdata      4      4      0
TCP                   68     68   1920   17    8 : tunables    0    0    0 : slabdata      4      4      0
hugetlbfs_inode_cache     29     29    552   29    4 : tunables    0    0    0 : slabdata      1      1      0
eventpoll_pwq        280    280     72   56    1 : tunables    0    0    0 : slabdata      5      5      0
inotify_inode_mark    184    184     88   46    1 : tunables    0    0    0 : slabdata      4      4      0
request_queue         17     17   1848   17    8 : tunables    0    0    0 : slabdata      1      1      0
blkdev_requests      552    552    344   23    2 : tunables    0    0    0 : slabdata     24     24      0
blkdev_ioc           156    156    104   39    1 : tunables    0    0    0 : slabdata      4      4      0
sock_inode_cache     300    300    640   25    4 : tunables    0    0    0 : slabdata     12     12      0
file_lock_cache       76     76    208   19    1 : tunables    0    0    0 : slabdata      4      4      0
net_namespace          0      0   5632    5    8 : tunables    0    0    0 : slabdata      0      0      0
shmem_inode_cache   2025   2025    640   25    4 : tunables    0    0    0 : slabdata     81     81      0
proc_inode_cache    1695   1728    592   27    4 : tunables    0    0    0 : slabdata     64     64      0
sigqueue             100    100    160   25    1 : tunables    0    0    0 : slabdata      4      4      0
bdev_cache            84     84    768   21    4 : tunables    0    0    0 : slabdata      4      4      0
kernfs_node_cache  10132  10132    120   34    1 : tunables    0    0    0 : slabdata    298    298      0
mnt_cache            210    210    384   21    2 : tunables    0    0    0 : slabdata     10     10      0
inode_cache         4857   5490    536   30    4 : tunables    0    0    0 : slabdata    183    183      0
dentry             23463  24696    192   21    1 : tunables    0    0    0 : slabdata   1176   1176      0
iint_cache             0      0     80   51    1 : tunables    0    0    0 : slabdata      0      0      0
buffer_head        31356  31356    104   39    1 : tunables    0    0    0 : slabdata    804    804      0
nsproxy              292    292     56   73    1 : tunables    0    0    0 : slabdata      4      4      0
files_cache          105    105    768   21    4 : tunables    0    0    0 : slabdata      5      5      0
signal_cache         396    396    896   18    4 : tunables    0    0    0 : slabdata     22     22      0
sighand_cache        153    161   4224    7    8 : tunables    0    0    0 : slabdata     23     23      0
task_struct          232    243   3328    9    8 : tunables    0    0    0 : slabdata     27     27      0
anon_vma            4736   4736     64   64    1 : tunables    0    0    0 : slabdata     74     74      0
shared_policy_node    340    340     48   85    1 : tunables    0    0    0 : slabdata      4      4      0
numa_policy          170    170     24  170    1 : tunables    0    0    0 : slabdata      1      1      0
radix_tree_node     1708   1708    584   28    4 : tunables    0    0    0 : slabdata     61     61      0
idr_layer_cache      255    255   2096   15    8 : tunables    0    0    0 : slabdata     17     17      0
kmalloc-8192          80     80   8192    4    8 : tunables    0    0    0 : slabdata     20     20      0
kmalloc-4096        1354   1808   4096    8    8 : tunables    0    0    0 : slabdata    226    226      0
kmalloc-2048         306    320   2048   16    8 : tunables    0    0    0 : slabdata     20     20      0
kmalloc-1024        1605   1664   1024   16    4 : tunables    0    0    0 : slabdata    104    104      0
kmalloc-512         3051   3552    512   16    2 : tunables    0    0    0 : slabdata    222    222      0
kmalloc-256         1738   1984    256   16    1 : tunables    0    0    0 : slabdata    124    124      0
kmalloc-192         5985   5985    192   21    1 : tunables    0    0    0 : slabdata    285    285      0
kmalloc-128        15360  15552    128   32    1 : tunables    0    0    0 : slabdata    486    486      0
kmalloc-96          7350   7350     96   42    1 : tunables    0    0    0 : slabdata    175    175      0
kmalloc-64         18221  20032     64   64    1 : tunables    0    0    0 : slabdata    313    313      0
kmalloc-32          1664   1664     32  128    1 : tunables    0    0    0 : slabdata     13     13      0
kmalloc-16          2304   2304     16  256    1 : tunables    0    0    0 : slabdata      9      9      0
kmalloc-8           6144   6144      8  512    1 : tunables    0    0    0 : slabdata     12     12      0
kmem_cache_node      128    128     64   64    1 : tunables    0    0    0 : slabdata      2      2      0
kmem_cache            80     80    256   16    1 : tunables    0    0    0 : slabdata      5      5      0

aswild commented 5 years ago

Rebuilt wireguard with skb_scrub_packet patched for CONFIG_CAVIUM_IPFWD_OFFLOAD and it works too.

iperf3 might be slightly faster when terminating wireguard in the ER4 and then forwarding to a LAN host with the skb_scrub_packet patch, but it was pretty close.

zx2c4 commented 5 years ago

This is a bit of a frustrating situation as I don't have things setup to keep trying stuff, so it's quite hard to debug, and the octeon kernel won't build for qemu. If you've got a lot of patience, there are a million things I'm curious about in trying to track this bug down. For example:

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 622f6b6ae..29861409a 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -980,6 +980,7 @@ static void __init new_kmalloc_cache(int idx, unsigned long flags)
 {
    kmalloc_caches[idx] = create_kmalloc_cache(kmalloc_info[idx].name,
                    kmalloc_info[idx].size, flags);
+   pr_err("SARU making cache %d is 0x%llx called %s size %lu flags 0x%x\n", idx, kmalloc_caches[idx], kmalloc_info[idx].name, kmalloc_info[idx].size, flags);
 }

 /*
@@ -992,6 +993,7 @@ void __init create_kmalloc_caches(unsigned long flags)
    int i;

    for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
+       pr_err("SARU iteration %d, pre-state: 0x%llx\n", i, kmalloc_caches[i]);
        if (!kmalloc_caches[i])
            new_kmalloc_cache(i, flags);

Got IRC or something? Might be easier to work through it there, if you're up for that.

aswild commented 5 years ago

I can dig up an IRC client, but I'm not super comfortable testing out kernel patches. When I soft-bricked at first, I wasn't able to break into a bootloader shell and don't know what would happen if I got stuck with an unbootable kernel.

Happy to test out wireguard patches as long as my roommate's not using the internet.

P.S. I sympathize with the struggle of debugging without hardware, and really appreciate your help on this issue!

zx2c4 commented 5 years ago

Okay what if you patch wireguard with the below and see at which point it crashes (i.e. send me the whole dmesg output):

diff --git a/src/main.c b/src/main.c
index 4b5b58e8..cda15a94 100644
--- a/src/main.c
+++ b/src/main.c
@@ -20,8 +20,20 @@

 static int __init mod_init(void)
 {
+   unsigned long i;
+   void *ohnose;
    int ret;

+   for (i = 0; i < ilog2(0x100000000); ++i) {
+       pr_err("About to allocate size %lu, index %d", 1UL << i, kmalloc_index(1UL << i));
+       ohnose = kmalloc(1UL << i, GFP_KERNEL);
+       if (!ohnose) {
+           pr_err("Allocation failed at size %lu\n", 1UL << i);
+           break;
+       }
+       kfree(ohnose);
+   }
+
    if ((ret = chacha20_mod_init()) || (ret = poly1305_mod_init()) ||
        (ret = chacha20poly1305_mod_init()) || (ret = blake2s_mod_init()) ||
        (ret = curve25519_mod_init()))

aswild commented 5 years ago

Sure, I can try that out (as soon as I can find a reasonable maintenance window). One issue is that systemd seems to capture most of the kernel output once it starts so the prints before the panic might get dropped. I'll play around with printk levels to see if I can make them hit the console unconditionally.

zx2c4 commented 5 years ago

Those are pr_err prints, so they should be somewhat unconditional.

I wasn't aware edgemax had moved to systemd.

aswild commented 5 years ago

Yeah, EdgeOS v2.0 switched to Debian Stretch with systemd. Here's the output after insmod with the kmalloc patch. Interestingly it didn't panic in this context. I did rmmod wireguard then insmod /tmp/wireguard.ko.

Here's the dmesg output starting after the insmod. Did you want the full log starting at boot?

[94275.974092] wireguard: About to allocate size 1, index 5
[94275.977934] wireguard: About to allocate size 2, index 5
[94275.981942] wireguard: About to allocate size 4, index 5
[94275.985803] wireguard: About to allocate size 8, index 5
[94275.989814] wireguard: About to allocate size 16, index 5
[94275.993733] wireguard: About to allocate size 32, index 5
[94275.997839] wireguard: About to allocate size 64, index 6
[94276.001759] wireguard: About to allocate size 128, index 7
[94276.005948] wireguard: About to allocate size 256, index 8
[94276.009955] wireguard: About to allocate size 512, index 9
[94276.014144] wireguard: About to allocate size 1024, index 10
[94276.018324] wireguard: About to allocate size 2048, index 11
[94276.022679] wireguard: About to allocate size 4096, index 12
[94276.026867] wireguard: About to allocate size 8192, index 13
[94276.031223] wireguard: About to allocate size 16384, index 14
[94276.035506] wireguard: About to allocate size 32768, index 15
[94276.039951] wireguard: About to allocate size 65536, index 16
[94276.044235] wireguard: About to allocate size 131072, index 17
[94276.048768] wireguard: About to allocate size 262144, index 18
[94276.053128] wireguard: About to allocate size 524288, index 19
[94276.057679] wireguard: About to allocate size 1048576, index 20
[94276.062147] wireguard: About to allocate size 2097152, index 21
[94276.066814] wireguard: About to allocate size 4194304, index 22
[94276.071356] wireguard: About to allocate size 8388608, index 23
[94276.076194] wireguard: About to allocate size 16777216, index 24
[94276.081217] wireguard: About to allocate size 33554432, index 25
[94276.087004] wireguard: About to allocate size 67108864, index 26
[94276.091534] ------------[ cut here ]------------
[94276.094880] WARNING: CPU: 0 PID: 19738 at mm/page_alloc.c:3544 __alloc_pages_nodemask+0x2f8/0xca8
[94276.102452] Modules linked in: wireguard(O+) sch_fq_codel sch_htb xt_nat xt_multiport ip6_udp_tunnel udp_tunnel 8021q garp stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_NETMAP xt_set nf_log_ipv4 ipt_REJECT nf_reject_ipv4 nf_log_ipv6 nf_log_common nf_conntrack_ipv6 nf_defrag_ipv6 xt_LOG xt_tcpudp xt_comment xt_conntrack ip_set_bitmap_port ip6table_mangle ip6table_filter ip6table_raw ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_mangle xt_CT iptable_raw nf_nat_h323 nf_conntrack_h323 nf_nat_sip nf_conntrack_sip nf_nat_tftp nf_nat_ftp nf_conntrack_tftp nf_conntrack_ftp ip_set_hash_net ip_set nfnetlink iptable_filter cvm_ipsec_kame(O) imq cavium_ip_offload(O) ubnt_nf_app(O) tdts(PO) octeon_rng rng_core nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre                                                                                                                                             
[94276.172413]  nf_nat nf_conntrack ubnt_platform(PO) ip_tables x_tables ipv6 [last unloaded: wireguard]
[94276.180422] CPU: 0 PID: 19738 Comm: insmod Tainted: P           O    4.9.79-UBNT #1
[94276.186772] Stack : 0000000000000000 0000000000000004 0000000000000006 0000000000000000
[94276.193528]         ffffffff80e00000 ffffffff80f65eb0 ffffffff80f60000 ffffffff80e00000
[94276.200283]         0000000000000000 0000000000000000 0000000000000047 0000000000000000
[94276.207037]         ffffffff80f60000 ffffffff808c07c8 0000000000000004 ffffffff808c18c8
[94276.213791]         0000000000000000 0000000000000000 0000000000000000 ffffffff80f60000
[94276.220545]         ffffffff80d7a468 ffffffff80df3f07 8000000046418d00 ffffffff80f5c300
[94276.227300]         0000000000004d1a 0000000000000000 0000000000100001 ffffffff808fae64
[94276.234054]         ffffffff808e7b20 8000000047cbb860 8000000047cbb978 ffffffff80aa9234
[94276.240809]         0000000000000000 ffffffff808c2000 000000000000000a ffffffff80d7a468
[94276.247563]         0000000000000000 ffffffff808601c8 0000000000000000 0000000000000000
[94276.254318]         ...
[94276.255482] Call Trace:
[94276.256631] [<ffffffff808601c8>] show_stack+0x90/0xb0
[94276.260383] [<ffffffff80aa9234>] dump_stack+0x84/0xc0
[94276.264134] [<ffffffff8087eb08>] __warn+0x100/0x118
[94276.267712] [<ffffffff809066e8>] __alloc_pages_nodemask+0x2f8/0xca8
[94276.272681] [<ffffffff80922e54>] kmalloc_order+0x14/0x80
[94276.276728] [<ffffffffc05c7250>] mod_init+0x250/0x3b4 [wireguard]
[94276.281535] [<ffffffff80800610>] do_one_initcall+0x40/0x140
[94276.285809] [<ffffffff808fb2ac>] do_init_module+0x64/0x1b4
[94276.289995] [<ffffffff808eaa4c>] load_module+0x1dcc/0x2090
[94276.294177] [<ffffffff808eafc4>] SyS_finit_module+0xcc/0xf0
[94276.298449] [<ffffffff8086deec>] syscall_common+0x18/0x3c
[94276.302616] ---[ end trace 3be245c725359407 ]---
[94276.305945] wireguard: Allocation failed at size 67108864
[94276.310101] wireguard: WireGuard 0.0.20190227 loaded. See www.wireguard.com for information.
[94276.317258] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.

phillipmcmahon commented 5 years ago

has there been any progress on this? I am happy to test packages (assuming no risk of bricking my ER-6P, it has a serial port on it but not sure how far I can screw things up) and if someone can point me in the right direction to setting up a compile toolchain I will gladly assist in this too.

aswild commented 5 years ago

@phillipmcmahon The patch in this comment (my modification of this one for v2.0) seems to fix the kernel panics, but I don't think a proper root cause has been found.

If you're willing to rebuild your kernel you could test out Jason's debugging patch here, but custom kernels are riskier than just testing the wireguard module, and I'm not sure what the exact recovery procedure would be.

zx2c4 commented 5 years ago

If all goes well I should shortly have possession of an ERL. Any specific firmware I should be using?

phillipmcmahon commented 5 years ago

If all goes well I should shortly have possession of an ERL. Any specific firmware I should be using?

I don't know if the Ubiquiti folks would share the adoption numbers (if they have them) but my gut feeling is that with all the issues of the v2.0 firmware most folks are still running v1.10.x on their production setups and therefore would be a good starting point to focus on.

dlpwx commented 5 years ago

+1 for for focusing on fw v1.10.x. v2.0.0 makes grown men cry. v2.0.1 not in sight yet.

aswild commented 5 years ago

I agree 1.10.x is probably more common and thus a good starting point, but @phillipmcmahon @dlpwx what's so bad about 2.0.0? I've been running since it came out and it's been totally solid.

phillipmcmahon commented 5 years ago

I agree 1.10.x is probably more common and thus a good starting point, but @phillipmcmahon @dlpwx what's so bad about 2.0.0? I've been running since it came out and it's been totally solid.

Terrible issues on the ER-X series, hardware reboots, hwnat-ing not working, igmp-proxy not working to name the issues I have had with my particular set up. I just needed a working set up so went back to v1.10.x

Then following the release thread on the forum it seems, at least by volume, to be the most problematic release in recent history for many many folks. Bricked units, partially working configs etc.

It seemed to leave beta whilst users were still reporting serious issues, not sure of what pressures they were experiencing to suddenly make it live as they did. Interesting it has also not received even a point update so far. I will wait until the forum gods announce this is good for daily use before I go back to it.

phillipmcmahon commented 5 years ago

Is there any progress on this, happy to help/test etc. as needed.

phillipmcmahon commented 5 years ago

Ping. Offering to help, things seem to have gone very quiet.

zx2c4 commented 5 years ago

Quiet, yes, but not forgotten. Lots of unexpected travel precluding my access to the hardware right now. I'd suggest @Lochnair apply the workaround I posted above to his builds until I'm back home and can figure out what UBNT is doing to their kernels.

phillipmcmahon commented 5 years ago

Appreciate the response, and also to know at some point things will pick up again. There has been another release of WireGuard in the meantime, v0.0.20190406.

zx2c4 commented 5 years ago

Indeed. I'm the one who made that release :)

I don't expect it will fix the kmalloc problem, though.

phillipmcmahon commented 5 years ago

haha, my bad. I should know whom I am talking with next time :)

dampfklon commented 5 years ago

I can confirm 0406 still crashes without the patch

coreyhines commented 5 years ago

I am willing to test on ER-4 EDGEOS FW 2.0.1 if deb packages go back up again.

Lochnair commented 5 years ago

Packages with the patch applied are available from the build server now:

If they work for you, I'll tag a new release with them.

phillipmcmahon commented 5 years ago

Fingers crossed, installing now on my 6P...

Update: Installed, rebooted and it all came back up and within these first few minutes it looks ok. My WireGuard client connected without issue and traffic is-a-flowing. I will keep hammering it this evening and see if something "bad" happens.

Early to call it, but thanks a lot.

phillipmcmahon commented 5 years ago

Several GB have passed through the multiple WG interfaces I have installed on my 6P. All looks pretty solid. No issues noted as of yet.

aswild commented 5 years ago

Thanks for the build! The 2.0 package seems sane on my ER4 v2.0.1

coreyhines commented 5 years ago

Allen,

Can you share the relevant portions of your config? I can get the tunnel to activate but the routes aren't getting pushed. This is my first time setting up Wireguard, see config. Thanks in advance.

Client config:

[Interface] PrivateKey = Address = 192.168.10.2/32

[Peer] PublicKey = AllowedIPs = 0.0.0.0/0 ::/0 Endpoint = gw.freeblizz.com:53922

ER4 config:

name WAN_LOCAL { default-action drop description "WAN to router" rule 31 { action accept description wireguard destination { port 53922 } log enable protocol udp state { established enable invalid disable new enable related enable }

wireguard wg0 { address 192.168.10.1/24 listen-port 53922 peer { allowed-ips 192.168.10.2/32 } private-key route-allowed-ips true

nat { rule 5011 { description "Masquerade for wg0" outbound-interface wg0 protocol all type masquerade }

Corey Hines Systems Engineer Arista Networks m 612-209-6550 o 408-547-8075 chines@arista.com TAC: support@arista.com www.arista.com Arista EOS: A Tale of Opposite Architectures https://www.youtube.com/watch?v=Hfwr6sY27hA&authuser=1 Download the EOS Configuration Manual https://www.arista.com/assets/data/docs/Manuals/EOS-4.15.4F-Manual.pdf Install vEOS-lab for testing & training https://eos.arista.com/running-veos-on-esxi-5-5/

On Mon, Apr 15, 2019 at 6:25 PM Allen Wild notifications@github.com wrote:

Thanks for the build! The 2.0 package seems sane on my ER4 v2.0.1

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Lochnair/vyatta-wireguard/issues/97#issuecomment-483454798, or mute the thread https://github.com/notifications/unsubscribe-auth/AH5-LR5Dw6hitkvv9Ja_A3FmBtXgTYviks5vhQpygaJpZM4bZoHe .

dc361 commented 5 years ago

Corey -- try your configuration for the peer without the ipv6 default network. I've had a problem with this the last few versions and have had to use a script to add it after the link is up using the wg command directly. For some reason on the ER's if the ::/0 (or 0::/0) is present in the saved config it doesn't work.

Lochnair / vyatta-wireguard

Kernel Panic on 2/27 build with USG #97