configuration loss after upgrade from v2018.1 to v2018.1.x (ea9a69f)

rotanid commented 6 years ago

~20 of our nodes are already running on the v2018.1.x branch. until recently, they ran based on v2018.1 release tag. when updating to ea9a69f7da94ca6c8b2f4a653c807c78f17e23ec, 4 of those nodes lost their configuration during the update. (so they start into config mode)

these were different devices (Futro S550, Archer C7, UAP AC Mesh, Archer C59) from two different targets (ar71xx, x86) i was able to reproduce the issue using a Futro and more easily using a virtual machine. (x86-64) i downgrade to the v2018.1 build via sysupgrade - and after that i upgrade via autoupdater to the build based on ea9a69f7da94ca6c8b2f4a653c807c78f17e23ec the configuration loss never happens with the downgrade, only when doing the autoupdater-upgrade, but even then not every time.

here's the log before the node reboots into the updated firmware:

 root@vm-test:/# autoupdater
Retrieving manifest from http://fw1.tecff.net/v5/experimental/sysupgrade/experimental.manifest ...
autoupdater: warning: error downloading manifest: Connection failed
Retrieving manifest from http://[fdef:ffc0:4ffa::11]/fw/v5/experimental/sysupgrade/experimental.manifest ...
Stopping cron...
Stopping haveged...
Stopping micrond...
Stopping sysntpd...
Stopping gluon-radvd...
Stopping uhttpd...
Stopping sse-multiplexd...
Stopping gluon-respondd...
[   68.539371] sysctl (3549): drop_caches: 3
vm.drop_caches = 3
Downloading image:  11540 / 11540 KiB
Stopping network...
[   73.892522] br-client: port 2(bat0) entered disabled state
[   73.894216] br-client: port 1(local-port) entered disabled state
[   73.897583] batman_adv: bat0: Interface deactivated: primary0
[   73.899057] batman_adv: bat0: Removing interface: primary0
[   73.901081] batman_adv: bat0: Interface deactivated: mesh-vpn
[   73.902503] batman_adv: bat0: Removing interface: mesh-vpn
[   73.917688] device bat0 left promiscuous mode
[   73.918972] br-client: port 2(bat0) entered disabled state
[   73.927466] device local-port left promiscuous mode
[   73.928866] br-client: port 1(local-port) entered disabled state
[   74.012115] br-wan: port 1(eth0) entered disabled state
[   74.015123] device eth0 left promiscuous mode
[   74.016263] br-wan: port 1(eth0) entered disabled state
[   74.017608] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
ip: can't find device 'bat0'
autoupdater: warning: execution of /usr/lib/autoupdater/upgrade.d/10stop-network exited with status code 1
[   75.017606] sysctl (3779): drop_caches: 3
vm.drop_caches = 3
Image metadata not found
Reading partition table from bootdisk...
Reading partition table from image...
Saving config files...
Commencing upgrade. All shell sessions will be closed now.
killall: telnetd: no process killed
Sending TERM to remaining processes ... ubusd askfirst logd gluon-arp-limit dnsmasq fastd dnsmasq 
Sending KILL to remaining processes ... 
Switching to ramdisk...
[   79.258707] EXT4-fs (vda2): re-mounted. Opts: block_validity,delalloc,barrier,user_xattr
Performing system upgrade...
Reading partition table from bootdisk...
0+1 records in
0+1 records out
Reading partition table from image...
Writing image to /dev/vda1...
32768+0 records in
16+0 records out
Writing image to /dev/vda2...
524288+0 records in
256+0 records out
Writing new UUID to /dev/vda...
4+0 records in
4+0 records out
[   96.076645] EXT4-fs (vda1): mounted filesystem without journal. Opts: (null)
Upgrade completed
Rebooting system...
umount: can't unmount /dev: Resource busy
umount: can't unmount /tmp: Resource busy

here's the boot log after this:

  Booting `LEDE'

[    0.000000] Linux version 4.4.139 (jenkins@example.com) (gcc version 5.4.0 (LEDE GCC 5.4.0 r3511+36-4151752) ) #0 SMP Sat Apr 23 13:20:29 2016
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz root=PARTUUID=dda457bf-02 rootfstype=ext4 rootwait console=tty0 console=ttyS0,115200n8 noinitrd
[    0.000000] x86/fpu: Legacy x87 FPU detected.
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000007fd8fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000007fd9000-0x0000000007ffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] SMBIOS 2.8 present.
[    0.000000] Hypervisor detected: KVM
[    0.000000] e820: last_pfn = 0x7fd9 max_arch_pfn = 0x400000000
[    0.000000] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- WT  
[    0.000000] found SMP MP-table at [mem 0x000f5d10-0x000f5d1f] mapped at [ffff8800000f5d10]
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: RSDP 0x00000000000F5B30 000014 (v00 BOCHS )
[    0.000000] ACPI: RSDT 0x0000000007FE14D1 00002C (v01 BOCHS  BXPCRSDT 00000001 BXPC 00000001)
[    0.000000] ACPI: FACP 0x0000000007FE13E5 000074 (v01 BOCHS  BXPCFACP 00000001 BXPC 00000001)
[    0.000000] ACPI: DSDT 0x0000000007FE0040 0013A5 (v01 BOCHS  BXPCDSDT 00000001 BXPC 00000001)
[    0.000000] ACPI: FACS 0x0000000007FE0000 000040
[    0.000000] ACPI: APIC 0x0000000007FE1459 000078 (v01 BOCHS  BXPCAPIC 00000001 BXPC 00000001)
[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[    0.000000] kvm-clock: cpu 0, msr 0:7fd7001, primary cpu clock
[    0.000000] kvm-clock: using sched offset of 2199078323103 cycles
[    0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.000000]   DMA32    [mem 0x0000000001000000-0x0000000007fd8fff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.000000]   node   0: [mem 0x0000000000100000-0x0000000007fd8fff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x0000000007fd8fff]
[    0.000000] ACPI: PM-Timer IO Port: 0x608
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[    0.000000] IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
[    0.000000] Using ACPI (MADT) for SMP configuration information
[    0.000000] smpboot: Allowing 1 CPUs, 0 hotplug CPUs
[    0.000000] e820: [mem 0x08000000-0xfeffbfff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on KVM
[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.000000] setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:1 nr_node_ids:1
[    0.000000] PERCPU: Embedded 32 pages/cpu @ffff880007c00000 s91544 r8192 d31336 u2097152
[    0.000000] KVM setup async PF for cpu 0
[    0.000000] kvm-stealtime: cpu 0, msr 7c0ff00
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 32097
[    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz root=PARTUUID=dda457bf-02 rootfstype=ext4 rootwait console=tty0 console=ttyS0,115200n8 noinitrd
[    0.000000] PID hash table entries: 512 (order: 0, 4096 bytes)
[    0.000000] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
[    0.000000] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
[    0.000000] Memory: 118960K/130524K available (5104K kernel code, 495K rwdata, 1920K rodata, 944K init, 552K bss, 11564K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] Kernel/User page tables isolation: enabled
[    0.000000] Hierarchical RCU implementation.
[    0.000000]  CONFIG_RCU_FANOUT set to non-default value of 32
[    0.000000]  RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=1.
[    0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
[    0.000000] NR_IRQS:4352 nr_irqs:256 16
[    0.000000] Console: colour VGA+ 80x25
[    0.000000] console [tty0] enabled
[    0.000000] console [ttyS0] enabled
[    0.000000] tsc: Detected 3392.294 MHz processor
[    0.121734] Calibrating delay loop (skipped) preset value.. 6784.58 BogoMIPS (lpj=33922940)
[    0.123896] pid_max: default: 32768 minimum: 301
[    0.125047] ACPI: Core revision 20150930
[    0.126568] ACPI: 1 ACPI AML tables successfully acquired and loaded
[    0.128163] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.129596] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.131213] CPU: Physical Processor ID: 0
[    0.132304] mce: CPU supports 10 MCE banks
[    0.133405] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[    0.134634] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
[    0.135975] Spectre V2 : Vulnerable: Minimal generic ASM retpoline
[    0.145357] Freeing SMP alternatives memory: 20K
[    0.150823] x2apic enabled
[    0.152000] Switched APIC routing to physical x2apic.
[    0.154041] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.270630] smpboot: CPU0: Intel Westmere E56xx/L56xx/X56xx (IBRS update) (family: 0x6, model: 0x2c, stepping: 0x1)
[    0.273469] Performance Events: unsupported p6 CPU model 44 no PMU driver, software events only.
[    0.275844] x86: Booted up 1 node, 1 CPUs
[    0.276894] smpboot: Total of 1 processors activated (6784.58 BogoMIPS)
[    0.278603] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.280856] futex hash table entries: 256 (order: 2, 16384 bytes)
[    0.282311] NET: Registered protocol family 16
[    0.283621] cpuidle: using governor ladder
[    0.284764] ACPI: bus type PCI registered
[    0.285806] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[    0.287203] PCI: Using configuration type 1 for base access
[    0.289958] ACPI: Added _OSI(Module Device)
[    0.291086] ACPI: Added _OSI(Processor Device)
[    0.292266] ACPI: Added _OSI(3.0 _SCP Extensions)
[    0.293502] ACPI: Added _OSI(Processor Aggregator Device)
[    0.296611] ACPI: Interpreter enabled
[    0.297634] ACPI: (supports S0 S5)
[    0.298635] ACPI: Using IOAPIC for interrupt routing
[    0.299867] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[    0.303606] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
[    0.305008] acpi PNP0A03:00: _OSC: OS supports [Segments MSI]
[    0.306289] acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
[    0.307769] acpiphp: Slot [3] registered
[    0.308834] acpiphp: Slot [4] registered
[    0.309926] acpiphp: Slot [6] registered
[    0.311019] acpiphp: Slot [7] registered
[    0.312108] acpiphp: Slot [8] registered
[    0.313178] acpiphp: Slot [9] registered
[    0.314245] acpiphp: Slot [10] registered
[    0.315347] acpiphp: Slot [11] registered
[    0.316453] acpiphp: Slot [12] registered
[    0.317540] acpiphp: Slot [13] registered
[    0.318617] acpiphp: Slot [14] registered
[    0.319714] acpiphp: Slot [15] registered
[    0.320842] acpiphp: Slot [16] registered
[    0.321942] acpiphp: Slot [17] registered
[    0.323025] acpiphp: Slot [18] registered
[    0.324115] acpiphp: Slot [19] registered
[    0.325974] acpiphp: Slot [20] registered
[    0.327081] acpiphp: Slot [21] registered
[    0.328194] acpiphp: Slot [22] registered
[    0.329268] acpiphp: Slot [23] registered
[    0.330378] acpiphp: Slot [24] registered
[    0.331465] acpiphp: Slot [25] registered
[    0.332549] acpiphp: Slot [26] registered
[    0.333637] acpiphp: Slot [27] registered
[    0.334709] acpiphp: Slot [28] registered
[    0.335806] acpiphp: Slot [29] registered
[    0.336886] acpiphp: Slot [30] registered
[    0.337966] acpiphp: Slot [31] registered
[    0.339025] PCI host bridge to bus 0000:00
[    0.340119] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
[    0.341574] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
[    0.343072] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
[    0.345045] pci_bus 0000:00: root bus resource [mem 0x08000000-0xfebfffff window]
[    0.346986] pci_bus 0000:00: root bus resource [mem 0x100000000-0x17fffffff window]
[    0.348927] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.353246] pci 0000:00:01.1: legacy IDE quirk: reg 0x10: [io  0x01f0-0x01f7]
[    0.354748] pci 0000:00:01.1: legacy IDE quirk: reg 0x14: [io  0x03f6]
[    0.357152] pci 0000:00:01.1: legacy IDE quirk: reg 0x18: [io  0x0170-0x0177]
[    0.358613] pci 0000:00:01.1: legacy IDE quirk: reg 0x1c: [io  0x0376]
[    0.360483] pci 0000:00:01.3: quirk: [io  0x0600-0x063f] claimed by PIIX4 ACPI
[    0.362925] pci 0000:00:01.3: quirk: [io  0x0700-0x070f] claimed by PIIX4 SMB
[    0.420620] ACPI: PCI Interrupt Link [LNKA] (IRQs 5 *10 11)
[    0.422849] ACPI: PCI Interrupt Link [LNKB] (IRQs 5 *10 11)
[    0.424533] ACPI: PCI Interrupt Link [LNKC] (IRQs 5 10 *11)
[    0.426208] ACPI: PCI Interrupt Link [LNKD] (IRQs 5 10 *11)
[    0.427855] ACPI: PCI Interrupt Link [LNKS] (IRQs *9)
[    0.429530] ACPI: Enabled 2 GPEs in block 00 to 0F
[    0.430929] SCSI subsystem initialized
[    0.432727] ACPI: bus type USB registered
[    0.433802] usbcore: registered new interface driver usbfs
[    0.435040] usbcore: registered new interface driver hub
[    0.436259] usbcore: registered new device driver usb
[    0.437474] PCI: Using ACPI for IRQ routing
[    0.438863] amd_nb: Cannot enumerate AMD northbridges
[    0.440180] clocksource: Switched to clocksource kvm-clock
[    0.441551] pnp: PnP ACPI init
[    0.442743] pnp: PnP ACPI: found 5 devices
[    0.449624] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    0.451783] NET: Registered protocol family 2
[    0.452948] TCP established hash table entries: 1024 (order: 1, 8192 bytes)
[    0.454409] TCP bind hash table entries: 1024 (order: 2, 16384 bytes)
[    0.466770] TCP: Hash tables configured (established 1024 bind 1024)
[    0.468178] UDP hash table entries: 256 (order: 1, 8192 bytes)
[    0.469471] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes)
[    0.470864] NET: Registered protocol family 1
[    0.471986] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
[    0.473302] pci 0000:00:01.0: PIIX3: Enabling Passive Release
[    0.474589] pci 0000:00:01.0: Activating ISA DMA hang workarounds
[    0.489943] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10
[    0.517366] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 11
[    0.544881] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
[    0.572200] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 10
[    0.588575] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[    0.590784] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 252)
[    0.592737] io scheduler noop registered
[    0.593755] io scheduler deadline registered (default)
[    0.595007] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
[    0.597355] hv_vmbus: registering driver hyperv_fb
[    0.598580] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
[    0.600548] ACPI: Power Button [PWRF]
[    0.680695] xenfs: not registering filesystem on non-xen platform
[    0.682300] Serial: 8250/16550 driver, 16 ports, IRQ sharing enabled
[    0.706593] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[    0.716370] Non-volatile memory driver v1.3
[    0.719342] random: nonblocking pool is initialized
[    0.721260] loop: module loaded
[    0.723286]  vda: vda1 vda2
[    0.724478] Guest personality initialized and is inactive
[    0.725752] VMCI host device registered (name=vmci, major=10, minor=63)
[    0.727180] Initialized host personality
[    0.728285] VMware PVSCSI driver - version 1.0.5.0-k
[    0.729493] hv_vmbus: registering driver hv_storvsc
[    0.731530] scsi host0: ata_piix
[    0.732596] scsi host1: ata_piix
[    0.733611] ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc160 irq 14
[    0.735071] ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc168 irq 15
[    0.737864] VMware vmxnet3 virtual NIC driver - version 1.4.5.0-k-NAPI
[    0.739294] hv_vmbus: registering driver hv_netvsc
[    0.740490] Fusion MPT base driver 3.04.20
[    0.741630] Copyright (c) 1999-2008 LSI Corporation
[    0.742830] Fusion MPT SPI Host driver 3.04.20
[    0.744008] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    0.745428] ehci-pci: EHCI PCI platform driver
[    0.759786] ehci-pci 0000:00:05.7: EHCI Host Controller
[    0.761327] ehci-pci 0000:00:05.7: new USB bus registered, assigned bus number 1
[    0.763432] ehci-pci 0000:00:05.7: irq 10, io mem 0xfc057000
[    0.780197] ehci-pci 0000:00:05.7: USB 2.0 started, EHCI 1.00
[    0.782186] hub 1-0:1.0: USB hub found
[    0.783636] hub 1-0:1.0: 6 ports detected
[    0.785324] ehci-platform: EHCI generic platform driver
[    0.787043] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[    0.788935] ohci-pci: OHCI PCI platform driver
[    0.790525] ohci-platform: OHCI generic platform driver
[    0.792189] uhci_hcd: USB Universal Host Controller Interface driver
[    0.808113] uhci_hcd 0000:00:05.0: UHCI Host Controller
[    0.809537] uhci_hcd 0000:00:05.0: new USB bus registered, assigned bus number 2
[    0.811626] uhci_hcd 0000:00:05.0: irq 10, io base 0x0000c0c0
[    0.813176] hub 2-0:1.0: USB hub found
[    0.814267] hub 2-0:1.0: 2 ports detected
[    0.828862] uhci_hcd 0000:00:05.1: UHCI Host Controller
[    0.830314] uhci_hcd 0000:00:05.1: new USB bus registered, assigned bus number 3
[    0.832494] uhci_hcd 0000:00:05.1: irq 11, io base 0x0000c0e0
[    0.834126] hub 3-0:1.0: USB hub found
[    0.835192] hub 3-0:1.0: 2 ports detected
[    0.849846] uhci_hcd 0000:00:05.2: UHCI Host Controller
[    0.851314] uhci_hcd 0000:00:05.2: new USB bus registered, assigned bus number 4
[    0.853404] uhci_hcd 0000:00:05.2: irq 11, io base 0x0000c100
[    0.854875] hub 4-0:1.0: USB hub found
[    0.855945] hub 4-0:1.0: 2 ports detected
[    0.857171] usbcore: registered new interface driver usb-storage
[    0.858544] i8042: PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
[    0.861368] serio: i8042 KBD port at 0x60,0x64 irq 1
[    0.862845] serio: i8042 AUX port at 0x60,0x64 irq 12
[    0.864148] hv_vmbus: registering driver hyperv_keyboard
[    0.865482] rtc_cmos 00:00: RTC can wake from S4
[    0.866861] rtc_cmos 00:00: rtc core: registered rtc_cmos as rtc0
[    0.869126] rtc_cmos 00:00: alarms up to one day, y3k, 114 bytes nvram
[    0.870688] sdhci: Secure Digital Host Controller Interface driver
[    0.872077] sdhci: Copyright(c) Pierre Ossman
[    0.873281] hidraw: raw HID events driver (C) Jiri Kosina
[    0.874574] hv_vmbus: registering driver hid_hyperv
[    0.875784] hv_utils: Registering HyperV Utility Driver
[    0.877030] hv_vmbus: registering driver hv_util
[    0.878193] hv_vmbus: registering driver hv_balloon
[    0.879530] NET: Registered protocol family 10
[    0.881066] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input1
[    0.883803] NET: Registered protocol family 17
[    0.885005] bridge: automatic filtering via arp/ip/ip6tables has been deprecated. Update your scripts to load br_netfilter if you need this.
[    0.887706] 8021q: 802.1Q VLAN Support v1.8
[    0.888841] NET: Registered protocol family 40
[    0.890059] SSE version of gcm_enc/dec engaged.
[    0.891437] registered taskstats version 1
[    0.893109] rtc_cmos 00:00: setting system clock to 2018-07-26 22:26:42 UTC (1532644002)
[    0.922257] EXT4-fs (vda2): mounted filesystem without journal. Opts: (null)
[    0.924384] VFS: Mounted root (ext4 filesystem) readonly on device 254:2.
[    0.926658] Freeing unused kernel memory: 944K
[    0.933225] init: Console is alive
[    0.949576] kmodloader: loading kernel modules from /etc/modules-boot.d/*
[    0.952201] pps_core: LinuxPPS API ver. 1 registered
[    0.953417] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[    0.955808] PTP clock support registered
[    0.958779] Button Hotplug driver version 0.4.1
[    0.961601] sp5100_tco: SP5100/SB800 TCO WatchDog Timer Driver v0.05
[    0.963849] kmodloader: 2 modules could not be probed
[    0.965194] kmodloader: - gpio-nct5104d - 0
[    0.966332] kmodloader: - leds-apu2 - 0
[    0.976475] init: - preinit -
[    1.000021] 8021q: adding VLAN 0 to HW filter on device eth0
Press the [f] key and hit [enter] to enter failsafe mode
Press the [1], [2], [3] or [4] key and hit [enter] to select the debug level
[    1.100134] usb 1-1: new high-speed USB device number 2 using ehci-pci
[    1.580186] tsc: Refined TSC clocksource calibration: 3392.320 MHz
[    1.581698] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x30e5f74424e, max_idle_ns: 440795232801 ns
WARNING: Variable 'led' does not exist or is not an array/object
[    3.022202] mount_root: mounting /dev/root
[    3.042622] EXT4-fs (vda2): re-mounted. Opts: (null)
[    3.044536] mount_root: loading kmods from internal overlay
[    3.048871] kmodloader: loading kernel modules from //etc/modules-boot.d/*
[    3.051721] kmodloader: 2 modules could not be probed
[    3.053251] kmodloader: - gpio-nct5104d - 0
[    3.054606] kmodloader: - leds-apu2 - 0
[    3.063194] block: attempting to load /etc/config/fstab
[    3.064503] block: unable to load configuration (fstab: Entry not found)
[    3.065948] block: no usable configuration
[    3.073927] EXT4-fs (vda1): mounted filesystem without journal. Opts: (null)
[    3.079800] EXT4-fs (vda1): re-mounted. Opts: block_validity,delalloc,barrier,user_xattr
[    3.086976] EXT4-fs (vda1): re-mounted. Opts: block_validity,delalloc,barrier,user_xattr
- config restore -
tar: invalid magic
tar: short read
[    3.091283] urandom-seed: Seed file not found (/etc/urandom.seed)
[    3.096562] procd: - early -
[    3.605561] procd: - ubus -
[    3.658439] procd: - init -
Please press Enter to activate this console.
[    3.677270] kmodloader: loading kernel modules from /etc/modules.d/*
[    3.679882] natsemi dp8381x driver, version 2.1, Sept 11, 2006
[    3.679882]   originally by Donald Becker <becker@scyld.com>
[    3.679882]   2.4.x kernel port by Jeff Garzik, Tjeerd Mulder
[    3.685437] ntfs: driver 2.1.32 [Flags: R/O MODULE].
[    3.687494] tun: Universal TUN/TAP device driver, 1.6
[    3.688859] tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
[    3.692157] e1000: Intel(R) PRO/1000 Network Driver - version 7.3.21-k8-NAPI
[    3.693660] e1000: Copyright (c) 1999-2006 Intel Corporation.
[    3.698162] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.3.0-k
[    3.699758] igb: Copyright (c) 2007-2014 Intel Corporation.
[    3.703010] ip6_tables: (C) 2000-2006 Netfilter Core Team
[    3.706671] Loading modules backported from Linux version wt-2017-01-31-0-ge882dff19e7f
[    3.709108] Backport generated by backports.git backports-20160324-13-g24da7d3c
[    3.722192] batman_adv: B.A.T.M.A.N. advanced 2018.1 (compatibility version 15) loaded
[    3.725511] i2c /dev entries driver
[    3.733040] u32 classifier
[    3.734106]     input device check on
[    3.735153]     Actions configured
[    3.736803] Mirror/redirect action on
[    3.747767] 8139too: 8139too Fast Ethernet driver 0.9.28
[    3.754955] usbcore: registered new interface driver carl9170
[    3.757657] usbcore: registered new interface driver cdc_wdm
[    3.759294] e100: Intel(R) PRO/100 Network Driver, 3.5.24-k2-NAPI
[    3.760771] e100: Copyright(c) 1999-2006 Intel Corporation
[    3.764582] e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k
[    3.765977] e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
[    3.768327] Ebtables v2.0 registered
[    3.770273] hso: drivers/net/usb/hso.c: Option Wireless
[    3.771545] usbcore: registered new interface driver hso
[    3.773593] ip_tables: (C) 2000-2006 Netfilter Core Team
[    3.774999] usbcore: registered new interface driver ipheth
[    3.776547] usbcore: registered new interface driver kaweth
[    3.778033] ne2k-pci.c:v1.03 9/22/2003 D. Becker/P. Gortmaker
[    3.793709] nf_conntrack version 0.5.0 (936 buckets, 3744 max)
[    3.796360] pcnet32: pcnet32.c:v1.35 21.Apr.2008 tsbogend@alpha.franken.de
[    3.798012] pegasus: v0.9.3 (2013/04/25), Pegasus/Pegasus II USB Ethernet driver
[    3.800046] usbcore: registered new interface driver pegasus
[    3.801722] usbcore: registered new interface driver r8152
[    3.804036] usbcore: registered new interface driver rt73usb
[    3.805625] usbcore: registered new interface driver rtl8187
[    3.807614] sis900.c: v1.08.10 Apr. 2 2006
[    3.808938] sky2: driver version 1.30
[    3.810185] tulip: Linux Tulip driver version 1.1.15-NAPI (Feb 27, 2007)
[    3.811854] usbcore: registered new interface driver ums-alauda
[    3.813304] usbcore: registered new interface driver ums-cypress
[    3.814736] usbcore: registered new interface driver ums-datafab
[    3.816157] usbcore: registered new interface driver ums-freecom
[    3.817603] usbcore: registered new interface driver ums-isd200
[    3.820360] usbcore: registered new interface driver ums-jumpshot
[    3.821961] usbcore: registered new interface driver ums-karma
[    3.823413] usbcore: registered new interface driver ums-sddr09
[    3.824846] usbcore: registered new interface driver ums-sddr55
[    3.826293] usbcore: registered new interface driver ums-usbat
[    3.828992] input: QEMU QEMU USB Tablet as /devices/pci0000:00/0000:00:05.7/usb1/1-1/1-1:1.0/0003:0627:0001.0001/input/input4
[    3.831794] hid-generic 0003:0627:0001.0001: input,hidraw0: USB HID v0.01 Mouse [QEMU QEMU USB Tablet] on usb-0000:00:05.7-1/input0
[    3.834466] usbcore: registered new interface driver usbhid
[    3.835773] usbhid: USB HID core driver
[    3.837755] usbcore: registered new interface driver usbserial
[    3.839168] usbcore: registered new interface driver usbserial_generic
[    3.840617] usbserial: USB Serial support registered for generic
[    3.842531] via_rhine: v1.10-LK1.5.1 2010-10-09 Written by Donald Becker
[    3.845918] xt_time: kernel timezone is -0000
[    3.847300] usbcore: registered new interface driver asix
[    3.849405] usbcore: registered new interface driver ath9k_htc
[    3.851235] usbcore: registered new interface driver ax88179_178a
[    3.853411] usbcore: registered new interface driver brcmfmac
[    3.854940] usbcore: registered new interface driver cdc_eem
[    3.856402] usbcore: registered new interface driver cdc_ether
[    3.858287] usbcore: registered new interface driver cdc_ncm
[    3.860064] usbcore: registered new interface driver cdc_subset
[    3.862101] usbcore: registered new interface driver dm9601
[    3.864346] usbcore: registered new interface driver ftdi_sio
[    3.866063] usbserial: USB Serial support registered for FTDI USB Serial Device
[    3.868824] usbcore: registered new interface driver huawei_cdc_ncm
[    3.870991] usbcore: registered new interface driver kalmia
[    3.872774] usbcore: registered new interface driver MOSCHIP usb-ethernet driver
[    3.875328] usbcore: registered new interface driver pl2303
[    3.876633] usbserial: USB Serial support registered for pl2303
[    3.878091] usbcore: registered new interface driver qmi_wwan
[    3.880208] usbcore: registered new interface driver rndis_host
[    3.882551] usbcore: registered new interface driver rtl8192cu
[    3.884123] usbcore: registered new interface driver sierra_net
[    3.885649] usbcore: registered new interface driver smsc95xx
[    3.887958] usbcore: registered new interface driver cdc_mbim
[    3.889841] kmodloader: 2 modules could not be probed
[    3.891185] kmodloader: - gpio-nct5104d - 0
[    3.892314] kmodloader: - leds-apu2 - 0
[    5.440986] 8021q: adding VLAN 0 to HW filter on device eth0
[    5.442571] device eth0 entered promiscuous mode
[    5.443855] br-setup: port 1(eth0) entered forwarding state
[    5.445142] br-setup: port 1(eth0) entered forwarding state
[    7.440125] br-setup: port 1(eth0) entered forwarding state

Adorfer commented 6 years ago

does this happen as well if you autoupdate to the same version of 2018.x again and again?

like backdating /lib/gluon/release and then autoupdate -f in other words: is it linked to

only some target
only a certain build
or just a very low percentage, but affecting all?

Adorfer commented 6 years ago

confirmed for "gluon-v2018.1-7-gea9a69f" after 8th or 9th run of the ssh-script: Node did not rejoin the network instead is in config-mode:

wget http://192.168.1.1/cgi-bin/config [..] "dus-1cfa68d3af93 - Wizard"

in other words: it lost it's node name. (and is in config-mode)

grafik

Adorfer commented 6 years ago

x86-64 ist auch betroffen, nach dem dritten(!) update, Knotennamen futsch, ist im Configmode.

grafik

Adorfer commented 6 years ago

to be more precise:

all tests performed via "autoupdater -f" (faked /etc/gluonrelease-file)
no downgrades, only identical images reinstalled over and over again
on gluon-v2018.1-7-gea9a69f loss of all parameters (sitting in config-mode) observed
- on ar7ktiny after 7-8 autoupdates
- on x86-64 after 4 autoupdates
on gluon-v2018.1-4-g04c002a
- ok after 33 upgrades on the same 841v9 and 34 on the same 3500v1 (stopped there manually)

rotanid commented 6 years ago

i just automated the testing and did not use the autoupdater this time, but only "sysupgrade". i tested this with qemu/x86. after 28 runs, there is still no problem - so somehow the issue is only happening when updating via the autoupdater.

mweinelt commented 6 years ago

I am unable to reproduce this on ea9a69f7da94ca6c8b2f4a653c807c78f17e23ec when

using sysupgrade (~10 times in a row)
using autoupdater -f --force-version with a mirror on the WAN interface (~100 times in a row)

Adorfer commented 6 years ago

reproduced again after 96 attempts

and no, i am not shore how to range that result.

My current guess(!) is, it may happen only if mesh-on-lan and vpn-on-wan is activated.

grafik

rubo77 commented 6 years ago

But why after 96 tries? Maybe it only happens if the connection to the download server is bad

Adorfer commented 6 years ago

sould not happen. if the entire file is not downloaded -> no valid signature -> no sysupgrade should start.

In other words: If that would be an issue, we would really have a problem. (and no, i do not think so.)

Adorfer commented 6 years ago

not able to reproduce on an 841v9 after 357 autoupdate -f i tried different scenarios (with/without lanmesh connected, with wan-vpn active/not connected) No "luck"... I am just running a new build, just to double check, eventually we are hunting "defect SPIs" on routers. (but this would not explain why vmdk-versions failed as well.)

BTW: https://www.monkeyuser.com/2017/steps-to-reproduce/

rotanid commented 6 years ago

i rebuilt my x86 image today on the "original" build machine (but manually, instead of jenkins - shouldn't matter though?) and i can't reproduce the issue anymore. if anyone knows a way to diff the old build of the same code with the newer build, please contact my via IRC Channel #gluon

Adorfer commented 6 years ago

desipte the fact that i can not reproduce it under lab environment,

last night i rebuild gluon-v2018.1-7-gea9a69f and deployed via autoupdater-webserver to 15 nodes. (gluon-v2018.1-7-gea9a69f to gluon-v2018.1-7-gea9a69f, just with some modified packages/i18n-stuff)

i lost the configs on more than 50% of the nodes (and i really do not know why, since the exactly same image of for example 841v9 did the job on a different machine >100 times flawlessly.

QEMU Standard PC (i440FX + PIIX, 1996)
Ubiquiti NanoStation loco M2
Ubiquiti UniFi-AC-LITE
TP-Link TL-WDR3500 v1
TP-Link TL-WR741N/ND v1
TP-Link TL-WR841N/ND v9
TP-Link TL-WR841N/ND v9
TP-Link TL-WR841N/ND v11
TP-Link TL-WR841N/ND v11
TP-Link TL-WR841N/ND v11

all those devices did the update from 2016.1.x to 2018.1.7 without problems.

my hypothesis for the moment is:

bug happens on units at the "2nd upgrade" , eg. from 2018.1.x to 2018.1.x on units previously coming from 2016.2.x upgrade to 2018.1.x. this means there could some config-remains of 2016.2 which causes a potential fail on the subsequent upgrades. (which has to be then found and eliminated)

Adorfer commented 6 years ago

reproduced "by accident" on going autoupdater 2018.1 to 2018.1 (idem version) full log https://paste.debian.net/1035900/

Switching to ramdisk...
mount: mounting /dev/mtdblock3 on /overlay failed: Resource busy
[  115.879170] BUG: Dentry 814de5d8{i=75,n=rfkill}  still in use (1) [unmount of                                           jffs2 jffs2]
[  115.887785] ------------[ cut here ]------------
[  115.892584] WARNING: CPU: 0 PID: 3479 at fs/dcache.c:1464 umount_check+0x80/0                                          x90()
[  115.900398] Modules linked in: ath9k ath9k_common xt_CT mac80211 ath9k_hw ath                                           batman_adv cfg80211 compat gpio_button_hotplug
[  115.912091] CPU: 0 PID: 3479 Comm: busybox Not tainted 4.4.139 #0
[  115.918389] Stack : 80425a1c 00000000 00000001 80480000 81a1f52c 804787a3 804                                          06e14 00000d97
          804f378c 00000001 8143ec38 00000000 80520000 800a7f98 8040c53c 8047000                                          0
          00000003 00000001 8040a924 8095bd34 80520000 800a5f14 0000004e 0000000                                          0
          00000001 801fd600 00000000 00000000 00000000 00000000 00000000 0000000                                          0
          00000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000                                          0
          ...
[  115.955420] Call Trace:
[  115.957965] [<80072254>] show_stack+0x54/0x88
[  115.962497] [<80082054>] warn_slowpath_common+0xa0/0xd0
[  115.967901] [<8008210c>] warn_slowpath_null+0x18/0x24
[  115.973134] [<8011d03c>] umount_check+0x80/0x90
[  115.977817] [<8011b824>] d_walk+0xc4/0x1d4
[  115.982056] [<8011c2bc>] do_one_tree+0x2c/0x48
[  115.986664] [<8011d1c8>] shrink_dcache_for_umount+0x48/0x80
[  115.992452] [<80109d84>] generic_shutdown_super+0x24/0xd8
[  115.998065] [<80217488>] kill_mtd_super+0x14/0x30
[  116.002944] [<801730d0>] jffs2_kill_sb+0x3c/0x54
[  116.007721] [<8010a1b4>] deactivate_locked_super+0x50/0x98
[  116.013424] [<8012404c>] cleanup_mnt+0x54/0x78
[  116.018028] [<80097698>] task_work_run+0x74/0xa0
[  116.022810] [<800609ac>] work_notifysig+0x10/0x18
[  116.027665]
[  116.029209] ---[ end trace 4ce681ca04f7ca0b ]---
[  116.042368] VFS: Busy inodes after unmount of jffs2. Self-destruct in 5 secon                                          ds.  Have a nice day...
Performing system upgrade...
Unlocking firmware ...

Writing from <stdin> to firmware ...
Appending jffs2 data from /tmp/sysupgrade.tgz to firmware..                                                               
Upgrade completed
Rebooting system...

[..]

[    8.189182] jffs2_scan_eraseblock(): End of filesystem marker found at 0x1000                                          0
[    8.196682] jffs2_build_filesystem(): unlocking the mtd device... done.
[    8.203574] jffs2_build_filesystem(): erasing all blocks after the end marker                                          ... done.
[    9.695366] jffs2: notice: (389) jffs2_build_xattr_subsystem: complete building xattr subsystem, 0 of xdatum (0 unchecked, 0 orphan) and 0 of xref (0 dead, 0 orphan) found.
[    9.713162] mount_root: overlay filesystem has not been fully initialized yet
[    9.721477] mount_root: switching to jffs2 overlay

TobleMiner commented 6 years ago

I've a very similar problem with my mesh-independent auotupdater. At first I assumed there was a bug in my code but closer inspection revealed what's going on:

When the autoupdater invokes the sysupgrade script the script creates a tgz archive of all files that should be preserved and stores it in ramfs. But under certain circumstances this fails. The busybox tar fails with the error message "gzip: short write". After patching busybox to print the actual errno error message I found that the first write gzip does fails with EBADF. This results in subsequent failure of tar with EPIPEsince busybox tar cz is implemented as tar c | gzip. Taking a look at the sysupgrade code it seems that

tar c${TAR_V}zf "$conf_tar" -T "$CONFFILES" 2>/dev/null

is the culprit. I'm not really sure why it fails with EBADF though. I just didn't have the necessary patience to compile busybox with debug symbols and throw it at gdb yet. Anyway, the failure of busybox tar results in an empty sysupgrade.tgz, mtd -j can't handle empty files and thus on reboot the sysupgrade.tgz file is not found.

This seems to happen only with busybox tar. Including a gnu tar in my build "fixed" the issue for me.

@Adorfer I'm seeing this error a lot on devices even running simple sysupgrades. I don't think it's related to the current problem. My devices always boot fine even with this error message.

Adorfer commented 6 years ago

currently i am looking https://git.busybox.net/busybox/commit/shell/ash.c?id=32fdf2f9fc9a617918672d71579f4ad42eb9bde9 not shure if that commit is related.

rotanid commented 6 years ago

@TobleMiner have you experienced your problem with multiple Gluon versions or which one in particular? we didn't have any single issue with the updates of v2016.2.7 or v2017.1.8 - only the latest update in the v2018.1.x branch lead me to create this issue.

also, which of the error messages do you see a lot?

TobleMiner commented 6 years ago

@rotanid I'm not quite sure if older gluon versions are affected as well. I'm developing using 2018.1.x only. Thus I've seen this issue upgrading from v2018.1.x to v2018.1.x across a bunch of revisions.

I can reproduce the problem pretty reliably with my mesh-independent autoupdater. Every single time the router updates without mesh connectivity it looses it's config. It doesn't when updating with mesh connectivity though.

I'm not seeing any error messages in particular. All interesting output is usually suppressed by 2>/dev/null. After removing that I see 'gzip: short write', 'tar: short write'. Both of those are generic error messages produces as soon as a full_write inside tar/gzip terminates prematurely with an error.

rotanid commented 6 years ago

@TobleMiner ok, so at least in your case the problem definitely is, that errors are suppressed instead of aborting or trying again... i guess this should be reported&fixed in openwrt upstream, but even then @NeoRaider is the right person to do it i guess

Adorfer commented 6 years ago

do we see this in master as well? I did not succeed to see it there. There the busybox (which is containing the gzip) is 1.27 afaik. how to make a 2018.1.x with a newer busybox version for a test? any suggestions?

rotanid commented 6 years ago

@Adorfer if your theory about busybox would be correct, Gluon v2017.1.x would also be affected as there was no change to busybox in the lede-17.01 branch since december 2017 https://git.openwrt.org/?p=openwrt/openwrt.git;a=history;f=package/utils/busybox;hb=refs/heads/lede-17.01 both v2017.1.x and v2018.1.x are based on this branch.

TobleMiner commented 6 years ago

@rotanid The problem is not that there is no "retrying". There is no valid reason for a tar czf to fail in that way under the mentioned circumstances. I'm pretty confident it's a severe bug. Especially since it seems to depend on the environment tar is invoked in.

Adorfer commented 6 years ago

is it possible that just a few communites did real hard tests von 2017?

rotanid commented 6 years ago

@Adorfer there are thousands of nodes running v2017.1.x, including our ~550 which had no update problems.

IronicSven commented 6 years ago

OpenWrt updated busybox from 1.28.4 to 1.29.2 yesterday: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=12fb4bb834dc715ce43bf428c35b489e89e776a1

And the changelog of busybox mentions some changes regarding gzip and tar: https://www.busybox.net/news.html

Adorfer commented 6 years ago

Wie wäre denn der sauberste weg, die aktuelle Busybox in 2018.1.x für einen Testlauf hineinzubekommen? immerhin hat das openwrt/lede-busybox eine größe Zahl (dutzend?) von Patches drin.

Adorfer commented 6 years ago

I just looked if i could see some errormesages. I removed the 2>/dev/null from the tar-line in /sbin/sysupgrade and got a busybox 1.28.3

at least the kernel-messages till occurr.

Watchdog handover: fd=3
- watchdog -
killall: telnetd: no process killed
Sending TERM to remaining processes ... dnsmasq tunneldigger dnsmasq ubusd logd gluon-arp-limit
Sending KILL to remaining processes ...
Switching to ramdisk...
mount: mounting /dev/mtdblock3 on /overlay failed: Resource busy
[  218.371736] BUG: Dentry 814d8a18{i=65,n=rfkill}  still in use (1) [unmount of jffs2 jffs2]
[  218.380364] ------------[ cut here ]------------
[  218.385163] WARNING: CPU: 0 PID: 3825 at fs/dcache.c:1464 umount_check+0x80/0x90()
[  218.392981] Modules linked in: ath9k ath9k_common xt_CT mac80211 ath9k_hw ath batman_adv cfg80211 compat gpio_button_hotplug
[  218.404675] CPU: 0 PID: 3825 Comm: busybox Not tainted 4.4.139 #0
[  218.410965] Stack : 80425a1c 00000000 00000001 80480000 817e6b0c 804787a3 80406e14 00000ef1
          804f378c 00000001 8143ec38 00000000 80520000 800a7f98 8040c53c 80470000
          00000003 00000001 8040a924 80909d34 80520000 800a5f14 0000004e 00000000
          00000001 801fd600 00000000 00000000 00000000 00000000 00000000 00000000
          00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
          ...
[  218.448014] Call Trace:
[  218.450559] [<80072254>] show_stack+0x54/0x88
[  218.455096] [<80082054>] warn_slowpath_common+0xa0/0xd0
[  218.460495] [<8008210c>] warn_slowpath_null+0x18/0x24
[  218.465728] [<8011d03c>] umount_check+0x80/0x90
[  218.470411] [<8011b824>] d_walk+0xc4/0x1d4
[  218.474660] [<8011c2bc>] do_one_tree+0x2c/0x48
[  218.479255] [<8011d1c8>] shrink_dcache_for_umount+0x48/0x80
[  218.485049] [<80109d84>] generic_shutdown_super+0x24/0xd8
[  218.490659] [<80217488>] kill_mtd_super+0x14/0x30
[  218.495538] [<801730d0>] jffs2_kill_sb+0x3c/0x54
[  218.500315] [<8010a1b4>] deactivate_locked_super+0x50/0x98
[  218.506017] [<8012404c>] cleanup_mnt+0x54/0x78
[  218.510621] [<80097698>] task_work_run+0x74/0xa0
[  218.515404] [<800609ac>] work_notifysig+0x10/0x18
[  218.520258]
[  218.521801] ---[ end trace f6ce98a49a4028e2 ]---
[  218.534892] VFS: Busy inodes after unmount of jffs2. Self-destruct in 5 seconds.  Have a nice day...
Performing system upgrade...

TobleMiner commented 6 years ago

Hey all, today I've had quite a long debugging session diving deep into the issue.

The trigger for our problem is a bug in libuclient. Under certain circumstances calling uclient_free closes the STDIN filedescriptor. Since the new autoupdater uses libuclient sometimes the /sbin/sysupgrade script is called with fd 0 closed:

root@Flitter:~# ls -la /proc/22307/fd
dr-x------    2 root     root             0 Aug  3 20:10 .
dr-xr-xr-x    8 root     root             0 Aug  3 20:10 ..
l-wx------    1 root     root            64 Aug  3 20:10 1 -> /dev/ttyS0
l-wx------    1 root     root            64 Aug  3 20:10 2 -> /dev/ttyS0
lr-x------    1 root     root            64 Aug  3 20:10 3 -> /tmp/lock/autoupdater.lock

As soon as tar is called by the sysupgrade script fd0 is reused as the file descriptor for the targzfile:

root@Flitter:~# ls -la /proc/22307/fd
dr-x------    2 root     root             0 Aug  3 20:10 .
dr-xr-xr-x    8 root     root             0 Aug  3 20:10 ..
l-wx------    1 root     root            64 Aug  3 20:10 0 -> /tmp/sysupgrade.tgz
l-wx------    1 root     root            64 Aug  3 20:10 1 -> /dev/ttyS0
l-wx------    1 root     root            64 Aug  3 20:10 2 -> /dev/ttyS0
lr-x------    1 root     root            64 Aug  3 20:10 3 -> /tmp/lock/autoupdater.lock

Since gzip compression was requested the tar process is forked and execs gzip:

root@Flitter:~# ls -la /proc/22672/fd
dr-x------    2 root     root             0 Aug  3 20:12 .
dr-xr-xr-x    8 root     root             0 Aug  3 20:12 ..
l-wx------    1 root     root            64 Aug  3 20:12 0 -> /tmp/sysupgrade.tgz
l-wx------    1 root     root            64 Aug  3 20:12 1 -> /dev/ttyS0
l-wx------    1 root     root            64 Aug  3 20:12 2 -> /dev/ttyS0
lr-x------    1 root     root            64 Aug  3 20:12 3 -> /tmp/lock/autoupdater.lock
lr-x------    1 root     root            64 Aug  3 20:12 4 -> pipe:[139603]
l-wx------    1 root     root            64 Aug  3 20:12 5 -> pipe:[139603]

But the code responsible for passing the output of tar to gzip assumes that the fd of the targzfile will not be zero and overwrites fd0 with a pipe to tar. This results in gzip trying to write to the unidirectional read-only pipe to tar producing an EBADF.

Adorfer commented 6 years ago

so what can we do about that except an upstream fix at libuclient? any chance for a hotfix?

TobleMiner commented 6 years ago

@Adorfer Fixing libuclient looks like the only option to me. I'll write a patchset for it

TobleMiner commented 6 years ago

I've just sent a patch for uclient on the openwrt-devel mailing list and a patch for busybox on the busybox mailing list. With a little luck we will have those two bugs fixed very soon.

EDIT: The fix for uclient has just been merged

rotanid commented 6 years ago

wow, great work @TobleMiner to sum up, we had three issues:

uclient bug ( https://git.openwrt.org/?p=project/uclient.git;a=commit;h=ae1c656ff041c6f1ccb37b070fa261e0d71f2b12 )
busybox tar bug ( http://lists.busybox.net/pipermail/busybox/2018-August/086580.html )
sysupgrade not handling the failing tar process (see next comment) ( https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=72489ebeb65fd1e1d77e8d9fbe105827a98dbf03 )

two of these have been fixed, one (uclient) is merged into openwrt-18.06 and also lede-17.01. the sysupgrade change is merged into OpenWrt master, but not 18.06 or 17.01, yet.

rotanid commented 6 years ago

about sysupgrade. i'm referring to line numbers in the following file: https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=package/base-files/files/sbin/sysupgrade;h=a378b029500ac5981d504889b0b3e6af3cf92a0d;hb=HEAD

please check if my assumptions are correct: the function do_save_conffiles is called in L180 and L249. it creates a gzipped tar file from a list of files and silences errors in L167. if tar in L167 results in an error, the function doesn't return an error but just proceeds deleting the file CONFFILES - which succeeds and therefore the function returns 0 in L169. of course in L180 there's a check for the return code, but a failing tar doesn't lead to a different return code as already stated. in L249 there isn't even any return code check.

TobleMiner commented 6 years ago

I guess the whole sysupgrade code use some refactoring. Especially some set -e -o pipefail may be pretty useful to enforce exit on error. The return code checks seem to be pretty broken and do not make a lot sense.

rotanid commented 6 years ago

hm, although i don't disagree, that's like saying "uh, those diesel cars have a bug in their exhaust system, let's replace them by new cars". in technical terms: a refactoring may be good, but is no solution for supported release branches.

and you didn't answer the question if my assumptions are correct ;-)

rotanid commented 6 years ago

hoorray, we found the root cause, which lead to the bugs described above: if an autoupdater mirror is not reachable, the next one tried is, then the config backup fails. this happens especially if DNS resolution fails.

also, i created a patch/PR for sysupgrade, see the comment above: https://github.com/freifunk-gluon/gluon/issues/1496#issuecomment-410398547

TobleMiner commented 6 years ago

@rotanid I'd not go as far as calling that the root cause. It's more like the scenario where the bug really surfaces. The root cause is still buggy libuclient closing STDIN which is now fixed in https://git.openwrt.org/?p=project/uclient.git;a=commit;h=ae1c656ff041c6f1ccb37b070fa261e0d71f2b12 .

rotanid commented 6 years ago

@Adorfer @TobleMiner i just pushed an update to v2018.1.x branch to update to current upstream lede-17.01 - which contains the updated/fixed uclient.

i think we should leave this issue open at least until we/i/someone also updated the master branch to current openwrt-18.06

Adorfer commented 6 years ago

as far as i understood the libuclient is used 1) directly by the autoupdater and 2) by busybox.

this LEDE-bump covers both of them?

addionally i would opto to have error handling in the autoupdater in case sysupgrade returns with an errorlevel. (i assume that this is not the case currently. -> freeze of the device, needs to be powercycled.

Please correct me i am mistaken.)

sistein commented 6 years ago

Ist von diesem Bug auch 2018.1 betroffen? Ich habe hier einen Bericht diesbezüglich und bin mir grad nicht sicher. Würde bei uns einen sofortigen Stop beim ausrollen bedeuten und wäre sicherlich relevant dann zu kommunizieren.

Edit: Sorry I have been in a hurry and just didn't realize it was in German. Switching between channels with different languages etc..

Is this bug also included in 2018.1? I got a reported regarding this and there is some uncertainity. We plan to roll out 2018.1 shortly and this would mean a stop for us.

sistein commented 6 years ago

Ich habe bei uns nachgefragt und es wurde ein Fall bei Nutzung des Webinterfaceupdaters berichtet. Es ist aber unklar ob evtl. der Haken nicht gesetzt war... (wäre auch 2017.7/8 auf 2018.1).

wenn ich den Bug richtig interpretiere wird er ja erst seit neuerer Firmware getriggered und hängt von der ausgehenden Firmware ab.

Edit: I asked further and the reported case was using the webinterfaceupdater. But the person is not totally sure if she had the checkbox activated or not. (And it would be a case auf 2017.7/8 to 2018.1).

If I interpret the bug correctly its only being triggered bei the new firmware (2018.1) and depends on the old Firmware (where the autoupdater is run).

Adorfer commented 6 years ago

Ist von diesem Bug auch 2018.1 betroffen?

Please have a look at the initial posting of this issue. "2018.1.x" is meant al "gluon branch 'v2018.1.x'" a description in german language you may find at https://forum.freifunk.net/t/gluon-2018-1-x-autoupdater-configverlust-gefahr/19322

sistein commented 6 years ago

Hi,

sorry for my missunderstanding. I was just unsure if 2018.1.x includes 2018.1 as well. There was a discussion. I did not see the topic in the freifunk board before as I was following this thread for a couple of days which resulted in internal discussions.

Your post there is quite clear. Thank you. I just wanted to eliminate any missunderstanding. Sorry if this disturbed ;-). If you prefer I can delete those messages to keep the feed clean.

rotanid commented 6 years ago

@Adorfer no, the busybox tar update has to make it into OpenWrt first, it is not covered yet. but with a fixed uclient, the busybox tar mini-flaw is no problem anymore, as far as i understand - my testing with updated uclient didn't suffer of the issue anymore. so every other change apart from the uclient bug (busybox tar, sysupgrade, autoupdater handling sysupgrade exit) is NOT necessary to fix this issue we're commenting in at the moment (but will be done eventually)

rotanid commented 6 years ago

fixed via 0cb98882d05924878775196fcdfa600c3d846215 15 days ago.

wrosenauer commented 6 years ago

just for confirmation: is it considered safe now to update to 2018.1 as I'm not sure I understand correctly if the issue is only on update from the broken version on. So should I better wait on 2017.1.x and update to the next gluon release or is it safe enough to go to 2018.1 as released before?

Adorfer commented 6 years ago

1) do not upgrade to 2018.1.x before this fix 2) if you have unit with early 2018.1.x or late 2017.x (not sure where this started a few months ago), you may be affected by the aftermath already. So consider the steps described in the linked forum post. (if not familiar with german language, ask for translation please)

rotanid commented 6 years ago

@Adorfer what makes you think v2017.1.x is affected? @wrosenauer if you don't know how to use the up-to-date branch, not the tag/release, please wait for v2018.1.1 tag/release

freifunk-gluon / gluon

configuration loss after upgrade from v2018.1 to v2018.1.x (ea9a69f) #1496