Closed rotanid closed 6 years ago
does this happen as well if you autoupdate to the same version of 2018.x again and again?
like backdating /lib/gluon/release and then autoupdate -f in other words: is it linked to
confirmed for "gluon-v2018.1-7-gea9a69f" after 8th or 9th run of the ssh-script: Node did not rejoin the network instead is in config-mode:
wget http://192.168.1.1/cgi-bin/config [..] "
in other words: it lost it's node name. (and is in config-mode)
x86-64 ist auch betroffen, nach dem dritten(!) update, Knotennamen futsch, ist im Configmode.
to be more precise:
i just automated the testing and did not use the autoupdater this time, but only "sysupgrade". i tested this with qemu/x86. after 28 runs, there is still no problem - so somehow the issue is only happening when updating via the autoupdater.
I am unable to reproduce this on ea9a69f7da94ca6c8b2f4a653c807c78f17e23ec when
sysupgrade
(~10 times in a row)autoupdater -f --force-version
with a mirror on the WAN interface (~100 times in a row)reproduced again after 96 attempts
and no, i am not shore how to range that result.
My current guess(!) is, it may happen only if mesh-on-lan and vpn-on-wan is activated.
But why after 96 tries? Maybe it only happens if the connection to the download server is bad
sould not happen. if the entire file is not downloaded -> no valid signature -> no sysupgrade should start.
In other words: If that would be an issue, we would really have a problem. (and no, i do not think so.)
not able to reproduce on an 841v9 after 357 autoupdate -f i tried different scenarios (with/without lanmesh connected, with wan-vpn active/not connected) No "luck"... I am just running a new build, just to double check, eventually we are hunting "defect SPIs" on routers. (but this would not explain why vmdk-versions failed as well.)
i rebuilt my x86 image today on the "original" build machine (but manually, instead of jenkins - shouldn't matter though?) and i can't reproduce the issue anymore. if anyone knows a way to diff the old build of the same code with the newer build, please contact my via IRC Channel #gluon
desipte the fact that i can not reproduce it under lab environment,
last night i rebuild gluon-v2018.1-7-gea9a69f and deployed via autoupdater-webserver to 15 nodes. (gluon-v2018.1-7-gea9a69f to gluon-v2018.1-7-gea9a69f, just with some modified packages/i18n-stuff)
i lost the configs on more than 50% of the nodes (and i really do not know why, since the exactly same image of for example 841v9 did the job on a different machine >100 times flawlessly.
all those devices did the update from 2016.1.x to 2018.1.7 without problems.
my hypothesis for the moment is:
reproduced "by accident" on going autoupdater 2018.1 to 2018.1 (idem version) full log https://paste.debian.net/1035900/
Switching to ramdisk...
mount: mounting /dev/mtdblock3 on /overlay failed: Resource busy
[ 115.879170] BUG: Dentry 814de5d8{i=75,n=rfkill} still in use (1) [unmount of jffs2 jffs2]
[ 115.887785] ------------[ cut here ]------------
[ 115.892584] WARNING: CPU: 0 PID: 3479 at fs/dcache.c:1464 umount_check+0x80/0 x90()
[ 115.900398] Modules linked in: ath9k ath9k_common xt_CT mac80211 ath9k_hw ath batman_adv cfg80211 compat gpio_button_hotplug
[ 115.912091] CPU: 0 PID: 3479 Comm: busybox Not tainted 4.4.139 #0
[ 115.918389] Stack : 80425a1c 00000000 00000001 80480000 81a1f52c 804787a3 804 06e14 00000d97
804f378c 00000001 8143ec38 00000000 80520000 800a7f98 8040c53c 8047000 0
00000003 00000001 8040a924 8095bd34 80520000 800a5f14 0000004e 0000000 0
00000001 801fd600 00000000 00000000 00000000 00000000 00000000 0000000 0
00000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000 0
...
[ 115.955420] Call Trace:
[ 115.957965] [<80072254>] show_stack+0x54/0x88
[ 115.962497] [<80082054>] warn_slowpath_common+0xa0/0xd0
[ 115.967901] [<8008210c>] warn_slowpath_null+0x18/0x24
[ 115.973134] [<8011d03c>] umount_check+0x80/0x90
[ 115.977817] [<8011b824>] d_walk+0xc4/0x1d4
[ 115.982056] [<8011c2bc>] do_one_tree+0x2c/0x48
[ 115.986664] [<8011d1c8>] shrink_dcache_for_umount+0x48/0x80
[ 115.992452] [<80109d84>] generic_shutdown_super+0x24/0xd8
[ 115.998065] [<80217488>] kill_mtd_super+0x14/0x30
[ 116.002944] [<801730d0>] jffs2_kill_sb+0x3c/0x54
[ 116.007721] [<8010a1b4>] deactivate_locked_super+0x50/0x98
[ 116.013424] [<8012404c>] cleanup_mnt+0x54/0x78
[ 116.018028] [<80097698>] task_work_run+0x74/0xa0
[ 116.022810] [<800609ac>] work_notifysig+0x10/0x18
[ 116.027665]
[ 116.029209] ---[ end trace 4ce681ca04f7ca0b ]---
[ 116.042368] VFS: Busy inodes after unmount of jffs2. Self-destruct in 5 secon ds. Have a nice day...
Performing system upgrade...
Unlocking firmware ...
Writing from <stdin> to firmware ...
Appending jffs2 data from /tmp/sysupgrade.tgz to firmware..
Upgrade completed
Rebooting system...
[..]
[ 8.189182] jffs2_scan_eraseblock(): End of filesystem marker found at 0x1000 0
[ 8.196682] jffs2_build_filesystem(): unlocking the mtd device... done.
[ 8.203574] jffs2_build_filesystem(): erasing all blocks after the end marker ... done.
[ 9.695366] jffs2: notice: (389) jffs2_build_xattr_subsystem: complete building xattr subsystem, 0 of xdatum (0 unchecked, 0 orphan) and 0 of xref (0 dead, 0 orphan) found.
[ 9.713162] mount_root: overlay filesystem has not been fully initialized yet
[ 9.721477] mount_root: switching to jffs2 overlay
I've a very similar problem with my mesh-independent auotupdater. At first I assumed there was a bug in my code but closer inspection revealed what's going on:
When the autoupdater invokes the sysupgrade script the script creates a tgz archive of all files that should be preserved and stores it in ramfs. But under certain circumstances this fails. The busybox tar fails with the error message "gzip: short write". After patching busybox to print the actual errno
error message I found that the first write gzip does fails with EBADF
. This results in subsequent failure of tar with EPIPE
since busybox tar cz
is implemented as tar c | gzip
. Taking a look at the sysupgrade code it seems that
tar c${TAR_V}zf "$conf_tar" -T "$CONFFILES" 2>/dev/null
is the culprit. I'm not really sure why it fails with EBADF
though. I just didn't have the necessary patience to compile busybox with debug symbols and throw it at gdb yet.
Anyway, the failure of busybox tar results in an empty sysupgrade.tgz
, mtd -j
can't handle empty files and thus on reboot the sysupgrade.tgz
file is not found.
This seems to happen only with busybox tar. Including a gnu tar in my build "fixed" the issue for me.
@Adorfer I'm seeing this error a lot on devices even running simple sysupgrades. I don't think it's related to the current problem. My devices always boot fine even with this error message.
currently i am looking https://git.busybox.net/busybox/commit/shell/ash.c?id=32fdf2f9fc9a617918672d71579f4ad42eb9bde9 not shure if that commit is related.
@TobleMiner have you experienced your problem with multiple Gluon versions or which one in particular? we didn't have any single issue with the updates of v2016.2.7 or v2017.1.8 - only the latest update in the v2018.1.x branch lead me to create this issue.
also, which of the error messages do you see a lot?
@rotanid I'm not quite sure if older gluon versions are affected as well. I'm developing using 2018.1.x only. Thus I've seen this issue upgrading from v2018.1.x to v2018.1.x across a bunch of revisions.
I can reproduce the problem pretty reliably with my mesh-independent autoupdater. Every single time the router updates without mesh connectivity it looses it's config. It doesn't when updating with mesh connectivity though.
I'm not seeing any error messages in particular. All interesting output is usually suppressed by 2>/dev/null
. After removing that I see 'gzip: short write', 'tar: short write'. Both of those are generic error messages produces as soon as a full_write inside tar/gzip terminates prematurely with an error.
@TobleMiner ok, so at least in your case the problem definitely is, that errors are suppressed instead of aborting or trying again... i guess this should be reported&fixed in openwrt upstream, but even then @NeoRaider is the right person to do it i guess
do we see this in master as well? I did not succeed to see it there. There the busybox (which is containing the gzip) is 1.27 afaik. how to make a 2018.1.x with a newer busybox version for a test? any suggestions?
@Adorfer if your theory about busybox would be correct, Gluon v2017.1.x would also be affected as there was no change to busybox in the lede-17.01 branch since december 2017 https://git.openwrt.org/?p=openwrt/openwrt.git;a=history;f=package/utils/busybox;hb=refs/heads/lede-17.01 both v2017.1.x and v2018.1.x are based on this branch.
@rotanid The problem is not that there is no "retrying". There is no valid reason for a tar czf
to fail in that way under the mentioned circumstances. I'm pretty confident it's a severe bug. Especially since it seems to depend on the environment tar is invoked in.
is it possible that just a few communites did real hard tests von 2017?
@Adorfer there are thousands of nodes running v2017.1.x, including our ~550 which had no update problems.
OpenWrt updated busybox from 1.28.4 to 1.29.2 yesterday: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=12fb4bb834dc715ce43bf428c35b489e89e776a1
And the changelog of busybox mentions some changes regarding gzip and tar: https://www.busybox.net/news.html
Wie wäre denn der sauberste weg, die aktuelle Busybox in 2018.1.x für einen Testlauf hineinzubekommen? immerhin hat das openwrt/lede-busybox eine größe Zahl (dutzend?) von Patches drin.
I just looked if i could see some errormesages. I removed the 2>/dev/null from the tar-line in /sbin/sysupgrade and got a busybox 1.28.3
at least the kernel-messages till occurr.
Watchdog handover: fd=3
- watchdog -
killall: telnetd: no process killed
Sending TERM to remaining processes ... dnsmasq tunneldigger dnsmasq ubusd logd gluon-arp-limit
Sending KILL to remaining processes ...
Switching to ramdisk...
mount: mounting /dev/mtdblock3 on /overlay failed: Resource busy
[ 218.371736] BUG: Dentry 814d8a18{i=65,n=rfkill} still in use (1) [unmount of jffs2 jffs2]
[ 218.380364] ------------[ cut here ]------------
[ 218.385163] WARNING: CPU: 0 PID: 3825 at fs/dcache.c:1464 umount_check+0x80/0x90()
[ 218.392981] Modules linked in: ath9k ath9k_common xt_CT mac80211 ath9k_hw ath batman_adv cfg80211 compat gpio_button_hotplug
[ 218.404675] CPU: 0 PID: 3825 Comm: busybox Not tainted 4.4.139 #0
[ 218.410965] Stack : 80425a1c 00000000 00000001 80480000 817e6b0c 804787a3 80406e14 00000ef1
804f378c 00000001 8143ec38 00000000 80520000 800a7f98 8040c53c 80470000
00000003 00000001 8040a924 80909d34 80520000 800a5f14 0000004e 00000000
00000001 801fd600 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
...
[ 218.448014] Call Trace:
[ 218.450559] [<80072254>] show_stack+0x54/0x88
[ 218.455096] [<80082054>] warn_slowpath_common+0xa0/0xd0
[ 218.460495] [<8008210c>] warn_slowpath_null+0x18/0x24
[ 218.465728] [<8011d03c>] umount_check+0x80/0x90
[ 218.470411] [<8011b824>] d_walk+0xc4/0x1d4
[ 218.474660] [<8011c2bc>] do_one_tree+0x2c/0x48
[ 218.479255] [<8011d1c8>] shrink_dcache_for_umount+0x48/0x80
[ 218.485049] [<80109d84>] generic_shutdown_super+0x24/0xd8
[ 218.490659] [<80217488>] kill_mtd_super+0x14/0x30
[ 218.495538] [<801730d0>] jffs2_kill_sb+0x3c/0x54
[ 218.500315] [<8010a1b4>] deactivate_locked_super+0x50/0x98
[ 218.506017] [<8012404c>] cleanup_mnt+0x54/0x78
[ 218.510621] [<80097698>] task_work_run+0x74/0xa0
[ 218.515404] [<800609ac>] work_notifysig+0x10/0x18
[ 218.520258]
[ 218.521801] ---[ end trace f6ce98a49a4028e2 ]---
[ 218.534892] VFS: Busy inodes after unmount of jffs2. Self-destruct in 5 seconds. Have a nice day...
Performing system upgrade...
Hey all, today I've had quite a long debugging session diving deep into the issue.
The trigger for our problem is a bug in libuclient. Under certain circumstances calling uclient_free
closes the STDIN filedescriptor.
Since the new autoupdater uses libuclient sometimes the /sbin/sysupgrade script is called with fd 0 closed:
root@Flitter:~# ls -la /proc/22307/fd
dr-x------ 2 root root 0 Aug 3 20:10 .
dr-xr-xr-x 8 root root 0 Aug 3 20:10 ..
l-wx------ 1 root root 64 Aug 3 20:10 1 -> /dev/ttyS0
l-wx------ 1 root root 64 Aug 3 20:10 2 -> /dev/ttyS0
lr-x------ 1 root root 64 Aug 3 20:10 3 -> /tmp/lock/autoupdater.lock
As soon as tar is called by the sysupgrade script fd0 is reused as the file descriptor for the targzfile:
root@Flitter:~# ls -la /proc/22307/fd
dr-x------ 2 root root 0 Aug 3 20:10 .
dr-xr-xr-x 8 root root 0 Aug 3 20:10 ..
l-wx------ 1 root root 64 Aug 3 20:10 0 -> /tmp/sysupgrade.tgz
l-wx------ 1 root root 64 Aug 3 20:10 1 -> /dev/ttyS0
l-wx------ 1 root root 64 Aug 3 20:10 2 -> /dev/ttyS0
lr-x------ 1 root root 64 Aug 3 20:10 3 -> /tmp/lock/autoupdater.lock
Since gzip compression was requested the tar process is forked and execs gzip:
root@Flitter:~# ls -la /proc/22672/fd
dr-x------ 2 root root 0 Aug 3 20:12 .
dr-xr-xr-x 8 root root 0 Aug 3 20:12 ..
l-wx------ 1 root root 64 Aug 3 20:12 0 -> /tmp/sysupgrade.tgz
l-wx------ 1 root root 64 Aug 3 20:12 1 -> /dev/ttyS0
l-wx------ 1 root root 64 Aug 3 20:12 2 -> /dev/ttyS0
lr-x------ 1 root root 64 Aug 3 20:12 3 -> /tmp/lock/autoupdater.lock
lr-x------ 1 root root 64 Aug 3 20:12 4 -> pipe:[139603]
l-wx------ 1 root root 64 Aug 3 20:12 5 -> pipe:[139603]
But the code responsible for passing the output of tar to gzip assumes that the fd of the targzfile will not be zero and overwrites fd0 with a pipe to tar. This results in gzip trying to write to the unidirectional read-only pipe to tar producing an EBADF.
so what can we do about that except an upstream fix at libuclient? any chance for a hotfix?
@Adorfer Fixing libuclient looks like the only option to me. I'll write a patchset for it
I've just sent a patch for uclient on the openwrt-devel mailing list and a patch for busybox on the busybox mailing list. With a little luck we will have those two bugs fixed very soon.
EDIT: The fix for uclient has just been merged
wow, great work @TobleMiner to sum up, we had three issues:
two of these have been fixed, one (uclient) is merged into openwrt-18.06 and also lede-17.01. the sysupgrade change is merged into OpenWrt master, but not 18.06 or 17.01, yet.
about sysupgrade. i'm referring to line numbers in the following file: https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=package/base-files/files/sbin/sysupgrade;h=a378b029500ac5981d504889b0b3e6af3cf92a0d;hb=HEAD
please check if my assumptions are correct: the function do_save_conffiles is called in L180 and L249. it creates a gzipped tar file from a list of files and silences errors in L167. if tar in L167 results in an error, the function doesn't return an error but just proceeds deleting the file CONFFILES - which succeeds and therefore the function returns 0 in L169. of course in L180 there's a check for the return code, but a failing tar doesn't lead to a different return code as already stated. in L249 there isn't even any return code check.
I guess the whole sysupgrade code use some refactoring. Especially some set -e -o pipefail
may be pretty useful to enforce exit on error. The return code checks seem to be pretty broken and do not make a lot sense.
hm, although i don't disagree, that's like saying "uh, those diesel cars have a bug in their exhaust system, let's replace them by new cars". in technical terms: a refactoring may be good, but is no solution for supported release branches.
and you didn't answer the question if my assumptions are correct ;-)
hoorray, we found the root cause, which lead to the bugs described above: if an autoupdater mirror is not reachable, the next one tried is, then the config backup fails. this happens especially if DNS resolution fails.
also, i created a patch/PR for sysupgrade, see the comment above: https://github.com/freifunk-gluon/gluon/issues/1496#issuecomment-410398547
@rotanid I'd not go as far as calling that the root cause. It's more like the scenario where the bug really surfaces. The root cause is still buggy libuclient closing STDIN which is now fixed in https://git.openwrt.org/?p=project/uclient.git;a=commit;h=ae1c656ff041c6f1ccb37b070fa261e0d71f2b12 .
@Adorfer @TobleMiner i just pushed an update to v2018.1.x branch to update to current upstream lede-17.01 - which contains the updated/fixed uclient.
i think we should leave this issue open at least until we/i/someone also updated the master branch to current openwrt-18.06
as far as i understood the libuclient is used 1) directly by the autoupdater and 2) by busybox.
this LEDE-bump covers both of them?
addionally i would opto to have error handling in the autoupdater in case sysupgrade returns with an errorlevel. (i assume that this is not the case currently. -> freeze of the device, needs to be powercycled.
Please correct me i am mistaken.)
Ist von diesem Bug auch 2018.1 betroffen? Ich habe hier einen Bericht diesbezüglich und bin mir grad nicht sicher. Würde bei uns einen sofortigen Stop beim ausrollen bedeuten und wäre sicherlich relevant dann zu kommunizieren.
Edit: Sorry I have been in a hurry and just didn't realize it was in German. Switching between channels with different languages etc..
Is this bug also included in 2018.1? I got a reported regarding this and there is some uncertainity. We plan to roll out 2018.1 shortly and this would mean a stop for us.
Ich habe bei uns nachgefragt und es wurde ein Fall bei Nutzung des Webinterfaceupdaters berichtet. Es ist aber unklar ob evtl. der Haken nicht gesetzt war... (wäre auch 2017.7/8 auf 2018.1).
wenn ich den Bug richtig interpretiere wird er ja erst seit neuerer Firmware getriggered und hängt von der ausgehenden Firmware ab.
Edit: I asked further and the reported case was using the webinterfaceupdater. But the person is not totally sure if she had the checkbox activated or not. (And it would be a case auf 2017.7/8 to 2018.1).
If I interpret the bug correctly its only being triggered bei the new firmware (2018.1) and depends on the old Firmware (where the autoupdater is run).
Ist von diesem Bug auch 2018.1 betroffen?
Please have a look at the initial posting of this issue. "2018.1.x" is meant al "gluon branch 'v2018.1.x'" a description in german language you may find at https://forum.freifunk.net/t/gluon-2018-1-x-autoupdater-configverlust-gefahr/19322
Hi,
sorry for my missunderstanding. I was just unsure if 2018.1.x includes 2018.1 as well. There was a discussion. I did not see the topic in the freifunk board before as I was following this thread for a couple of days which resulted in internal discussions.
Your post there is quite clear. Thank you. I just wanted to eliminate any missunderstanding. Sorry if this disturbed ;-). If you prefer I can delete those messages to keep the feed clean.
@Adorfer no, the busybox tar update has to make it into OpenWrt first, it is not covered yet. but with a fixed uclient, the busybox tar mini-flaw is no problem anymore, as far as i understand - my testing with updated uclient didn't suffer of the issue anymore. so every other change apart from the uclient bug (busybox tar, sysupgrade, autoupdater handling sysupgrade exit) is NOT necessary to fix this issue we're commenting in at the moment (but will be done eventually)
fixed via 0cb98882d05924878775196fcdfa600c3d846215 15 days ago.
just for confirmation: is it considered safe now to update to 2018.1 as I'm not sure I understand correctly if the issue is only on update from the broken version on. So should I better wait on 2017.1.x and update to the next gluon release or is it safe enough to go to 2018.1 as released before?
1) do not upgrade to 2018.1.x before this fix 2) if you have unit with early 2018.1.x or late 2017.x (not sure where this started a few months ago), you may be affected by the aftermath already. So consider the steps described in the linked forum post. (if not familiar with german language, ask for translation please)
@Adorfer what makes you think v2017.1.x is affected? @wrosenauer if you don't know how to use the up-to-date branch, not the tag/release, please wait for v2018.1.1 tag/release
~20 of our nodes are already running on the v2018.1.x branch. until recently, they ran based on v2018.1 release tag. when updating to ea9a69f7da94ca6c8b2f4a653c807c78f17e23ec, 4 of those nodes lost their configuration during the update. (so they start into config mode)
these were different devices (Futro S550, Archer C7, UAP AC Mesh, Archer C59) from two different targets (ar71xx, x86) i was able to reproduce the issue using a Futro and more easily using a virtual machine. (x86-64) i downgrade to the v2018.1 build via sysupgrade - and after that i upgrade via autoupdater to the build based on ea9a69f7da94ca6c8b2f4a653c807c78f17e23ec the configuration loss never happens with the downgrade, only when doing the autoupdater-upgrade, but even then not every time.
here's the log before the node reboots into the updated firmware:
here's the boot log after this: