Closed bhush9 closed 6 years ago
CC: @shadeslayer @llelectronics
any idea about this live-config weirdness?
How does this compare to the xenial img?
Other than the long delayed live.config entry, is the rest about the same?
I think there is extra delay in bionic image after cursor is visible... but I want to sort out live-config issue first and then compare with xenial image.
Not solved yet, but adding my notes..
modify kernel command line: console=ttyS0,115200n8 no_console_suspend console=tty0 mac_addr= root=/dev/mmcblk0p1 rw rootwait boot=live components hostname=live-pc username=live cma=384M live-config.debug systemd.log_level=debug
modify the sytemd journald config to have storage=persistent.
Full journal at https://bshah.in/journalctl-delay
Reading journal;
May 17 10:02:34 live-pc gpasswd[712]: user live added by root to group netdev
It tells following.
May 17 10:02:55 live-pc systemd-timesyncd[497]: Resolving ntp.ubuntu.com...
May 17 10:02:55 live-pc systemd-journald[159]: Forwarding to syslog missed 3314 messages.
May 17 10:02:55 live-pc systemd[1]: systemd-timesyncd.service: Got notification message from PID 497 (STATUS=Idle.)
right after that following lines repeat for 2 minutes and eventually ntp service gives up
May 17 10:02:55 live-pc systemd-resolved[492]: Got DNS stub UDP query packet for id 57735
May 17 10:02:55 live-pc systemd-resolved[492]: Looking up RR for ntp.ubuntu.com.localdomain IN AAAA.
May 17 10:02:55 live-pc systemd-resolved[492]: Sending response packet with id 57735 on interface 1/AF_INET.
May 17 10:02:55 live-pc systemd-resolved[492]: Processing query...
May 17 10:02:55 live-pc systemd-timesyncd[497]: Failed to resolve ntp.ubuntu.com: Temporary failure in name resolution
Flash image
edit /etc/live/config.conf.d/neon-users.conf
-LIVE_USER_DEFAULT_GROUPS="audio cdrom dip floppy video plugdev netdev powerdev scanner"
+LIVE_USER_DEFAULT_GROUPS="audio cdrom dip floppy video plugdev powerdev scanner"
This can workaround the issue, maybe
Tested:
boot1 - pass boot2 - fail (different SD card) [This waits forever (about 5 minutes)] boot3 - pass [check if it got tripped over by systemd-timesyncd running forever? - no] boot4 - pass boot5 - pass
[ pass being : it boots within normal time-frame and doesn't wait forever ]
Image: https://files.kde.org/neon/images/pinebook-remix-nonfree/bionic/20180521-0817
The image above contains the workaround mentioned above but first boot out of that image seems to fail for me..
boot1 - fail
So now investigating again with debug log and journal in hand.
Adding information about how to get logs,
set_cmdline=setenv bootargs console=${console} console=tty0 ${optargs} mac_addr=${ethaddr} root=${root} rw rootwait boot=live components hostname=live-pc username=live cma=384M systemd.log_level=debug live-config.debug=true
sudo journalctl
CC: @adriaandegroot
Notes from me:
delay seems to be in particular in /var/lib/live/0030-user-setup
This user setup script is an absolute load of Debian. The script to set a password involves randomly dd'ing a drive :/
Now I can add timestamps all over this file and bisect down till we hit the exact spot with the block, and come up with a fix from there.
Supercool you spotted this, if turns out the culprit.
They delay happens when we call
"adduser live powerdev"
in this for loop
for group in $RET; do
$log $chroot $ROOT adduser "$USER" $group >/dev/null 2>&1 || true
done
Others are instant, this one that doesn't exist takes exactly 90s
My lazy workaround was to replace
a few lines above. I'm sure you can do an alternative better fix.
There's a bit to finish working out. It doesn't entirely make sense as "adduser david asdfasdfasdf" does exit immediately. I have a working possible theory on why it blocks, but not proven yet.
They delay happens when we call
"adduser live powerdev"
I think this is false alarm, see my first comment where I had concrete data suggesting adding user to netdev causes the delay, and before that it was bluetooth user
That said I can give this try locally and see where it takes me.
I had concrete data suggesting adding user to netdev causes the delay
In your first comment we have
a timestamp of when it finishes successfully adding netdev a timestamp of when it finishes successfully adding scanner
If anything it proves it's not adding netdev, but whatever happens between netdev and scanner
Which assuming it's alphabetical could still match what I said
Your patch still has the user being added to the powerdev group
Continued investigation on exactly why we block
strace finally gave me something useful
connect(fd =4, "/run/dbus/system_bus_socket") fstat(fd=4) recvmsg(4) returns EAGAIN ppoll(fd=4) timesout after 90s
FYI, my patch didn't actually made it to last image.. sorry
This DBus call comes from nss-systemd which is linked to by the adduser perl script.
From a glance of the code it seems when a group doesn't exist it tries checking dynamically to get the most up to date information.
So to summarise, I'm pretty sure I've found everything and it all makes sense.
There are two minor bugs that would be fine on their own but together cause an issue.
Our install scripts try to add a user to a group that doesn't exist. This is obviously wrong and can be fixed by not adding a user to powerdev
When a group is not found, adduser queries systemd-nss to see if it's available dynamically at runtime. Depending on how fast the rest of the system started (our race) the system dbus we're querying on might not be set up properly and will hang.
This can be fixed by exporting SYSTEMD_NSS_BYPASS_BUS=1 in the script or just not installing libnss-systemd. Bshah implied it was a Neon specific patch adding it.
On my local system either mod works.
And this fixes the 2 minute delay?
Issue is locally fixed but needs to be packaged in image.. I will roll image later with fix (later today evening or tomorrow).
-- Bhushan Shah http://blog.bshah.in IRC Nick : bshah on Freenode GPG key fingerprint : 0AAC 775B B643 7A8D 9AF7 A3AC FE07 8411 7FBC E11D
https://files.kde.org/neon/images/pinebook-remix-nonfree/bionic/20180531-1230/ should be fixed in this image.
(Current fix is to not add user in powerdev group, will remove systemd-nss package in next image tomorrow).
When booting image
20180507-1433
there is mysterious delay of 2 minutes in live-config service which can extend upto 3+ mnutes in some cases.live-config service takes aroud 3 minutes which is not normal at all.. in journalctl one can see it "skips" two minutes..
Which I had earlier thought to be addition of live user in bluetooth group and tried following which seemed to work but just when I started writing this report it turns out it happens even before the bluetooth adduser call, so that theory is busted.
More investigation pending.