MichaIng / DietPi

Lightweight justice for your single-board computer!
https://dietpi.com/
GNU General Public License v2.0
4.76k stars 495 forks source link

System | Service start/boot timeouts due to low "entropy" #2806

Closed PeterLacknase closed 5 years ago

PeterLacknase commented 5 years ago

ADMIN EDIT

To check if your system is affected

dmesg | grep 'crng init'

G_DIETPI_VERSION_CORE=6 G_DIETPI_VERSION_SUB=24 G_DIETPI_VERSION_RC=0 G_GITBRANCH='master' G_GITOWNER='MichaIng'

stretch Linux dietpi 4.14.98-v7+ #1200 SMP Tue Feb 12 20:27:48 GMT 2019 armv7l GNU/Linux RPi 3 Model B+ (armv7l) 5V 2.5 A (CE) SanDisk Ultra 64GB

Additional Information (if applicable)

php7.3-fpm

updated from php7.0-fpm (during diet-update)

can be replicated on fresh install

Steps to reproduce

  1. install/update
  2. reboot

Expected behaviour

php7.3-fpm should start within some ms (not minutes)

Actual behaviour

systemctl start php7.3-fpm

Extra details

seems to be an entropy-problem

see:

solution

sudo apt install haveged

see http://www.issihosts.com/haveged/

(the one and only alternative afaics would be downgrading kernels at the moment)

extra comment

probably affects some more services ...

diveyez commented 5 years ago

haveged is required btw read th4e documents for php-fpm 7+

PeterLacknase commented 5 years ago

It‘s no dependency, should be then.

MichaIng commented 5 years ago

@PeterLacknase @diveyez Hmm yeah a similar issue was reported already, although not during service starts but on earlier boot stage: https://github.com/MichaIng/DietPi/issues/2760

Not sure why this issues arrive now. Seems to be some recent patch of Debian that created additional issues with this and v4.19 will ship a solution (expected soon for RPi).

However this is just relevant for headless systems with low user input. php-fpm itself does not use haveged directly (thus no dependency) but it seems to require a relatively large amount of entropy for it's startup. And on headless systems the entropy pool seems to be too low or does not fill fast enough. So haveged assures it is always filled.

I am glad we found this as a reason for the service restarts as well, at least we know a solution now. Since there are not yet many reports, I think we will not implement this by default into DietPi. But e.g. we can add an info via MOTD (in DietPi login banner) to try G_AGI haveged when facing boot hangs in combination with random: ... messages.

@PeterLacknase Btw does this indeed also solve the long lasting systemctl restart php7.3-fpm?

PeterLacknase commented 5 years ago

Yes. It solves that problem as well. You are right: Apparently headless systems are more prone to that lack of entropy. On all my SBCs hdmi, peripherals, gettys etc. are completely disabled. (Apart from one Ubiquity-machine with serial console.) So a hint for those kind of use cases would probably suffice.

MichaIng commented 5 years ago

Did some search:

One guy stated indeed to make haveged an essential packages (being installed on all Debian systems by default and cannot be removed without forcing it interactively) and should be even loaded into the initramfs (where used).

I didn't read every follow up completely, but indeed the issue seems to be present on Raspberry Pis especially, since that board ships not much hardware sources for entropy: https://lists.debian.org/debian-devel/2018/12/msg00293.html

I am wondering why I never faced this on my RPi2 🤔.

@PeterLacknase @diveyez If you are in mood, could you send a dietpi-bugreport and paste the ID here? So I can inspect the setups (installed software, startup services and such) to see if there are more similarities than just the board. Some more users ran into the same issue (in other GitHub issues), so I hope I can get at least a small pool (probably growing) of systems to compare.

On the development board, discussion stopped at end of 2018 without result... RPi kernel 4.19 seems to not have solved it, at I have another report where the kernel was already upgraded.

So far it seems that especially PHP7.3-FPM is affected, so probably, as long as Debian does not come up with a long term solution, we need to install haveged by default together with PHP, although I am not happy about an additional process and ~20M disk space 😒.

PeterLacknase commented 5 years ago

arm7 (rpi 3b+): 2d9a4ee6-fe58-495e-bb38-d5fd750fd204

arm6 (rpi zero w): 6dd50815-7b78-423b-90e6-a3e7e24a4854

MichaIng commented 5 years ago

@PeterLacknase Many thanks. haveged was required on both systems to solve service start hanging?

PeterLacknase commented 5 years ago

Yes. (PHP7.3-fpm, that is.)

diveyez commented 5 years ago

I will do the rpi zero w again this week to get you what you need. =)

Im getting undervolts and post hangs. I cannot even get to terminal to pull logs. Something is definitely wrong with preboot tasks, and its not just dietpi, its debian itself I feel. I am gonna need to reformat and go at it again. Will update tomorrow.


dietpi armv7 rpi3b+ was actually using php-fpm7.0 in my case, it appears debian took a long vacation with apt and confused her. The fix is going to be avoiding using php until that arm repo is patched (months)


Two things, the packages were likely compiled with haveged and apparmor listed as dependencies by accident, and pool.ntp.org is a liar. I noticed my clock was sent back into the past.

Is there really a way to fix this without building php ourselves?

diveyez commented 5 years ago

Images URL: https://github.com/diveyez/build.deb/tree/master/debug I really dont want to keep bothering you Michal but, you should have this photo, I also do not see a need to put this on another issue, I had the same issue as Peter did on this build/board, however after installing haveged and the dependencies with it, I am stuck here unable to escape to anywhere else but tty3 or tty4

PeterLacknase commented 5 years ago

Just for comparison: on my systems serial gettys are completely disabled. I‘m using screen sessions via ssh.

(Should work with serials nontheless.)

MichaIng commented 5 years ago

@diveyez Okay, so:

Please paste:

And I see something about PHP7.0 there. Your system should have been migrated to PHP7.3, Did you install PHP outside of DietPi-Software?

And as a last suggestion:

diveyez commented 5 years ago

Crypto was removed from PhP in .3 The processes I run see this as an insecure boot, and hang, everything up. A friend found the issue this morning and pointed it out. I didnt even notice that they did that. But it makes sense as to why my system rolled back when 7.3 was dropped in.

Will post lines when home. Thanks!

MichaIng commented 5 years ago

@diveyez I am still wondering what you mean with "Crypto", the following cryptography modules are available for PHP: https://www.php.net/manual/en/refs.crypto.php mcrypt was dropped (marked as deprecated) with PHP7.1 already.

diveyez commented 5 years ago

mcrypt is a dependency for my sanity check processes. This is the cause. I am left with the option to rebuild to fix this. No worries.

MichaIng commented 5 years ago

@diveyez Ah jep that's true. If possible it is recommended to migrate to use the openssl module.

diveyez commented 5 years ago

In the works, but first I have to rebuild unfortunately because my secops code literally kills ssh if php is vulnerable and I cannot migrate cleanly anymore after all the keybashing.

MichaIng commented 5 years ago

Related: https://github.com/MichaIng/DietPi/issues/2871 And: https://github.com/MichaIng/DietPi/issues/2870#issuecomment-497154708

An alternative to haveged: rng-tools

maartenlangeveld commented 5 years ago

@MichaIng

Experienced the same problems (extremely slow start-up, followed by a very slow start of all services) on my RPi Zero W after upgrade from v6.22, have therefore also installed haveged and since all seems to have returned to normal.

Since I read in another (closed) issue that you want to collect info on this issue I also sent a bugreport, hope it helps. Here's my dietpi-bugreport ID:

root@DietPi:~# dietpi-bugreport [ OK ] DietPi-Bugreport | Root access verified. [ OK ] DietPi-Bugreport | RootFS R/W access verified. [ INFO ] DietPi-Bugreport | Packing upload archive, please wait... [ OK ] DietPi-Bugreport | Connection test: ssh.dietpi.com [ OK ] DietPi-Bugreport | Bug report sent, reference code: e817deae-ad83-4663-85d4-32e559223eef

Same applies also to my NanoPi NEO2 SBC, let me know if you would like a bugreport from this device also.

MichaIng commented 5 years ago

@maartenlangeveld Many thanks.

Same applies also to my NanoPi NEO2 SBC, let me know if you would like a bugreport from this device also.

Jep that would be great, if you don't mind.

maartenlangeveld commented 5 years ago

I don't mind :-) Hereby (NanoPi NEO2) bugreport ID

root@DietPi-DNS:~# dietpi-bugreport [ OK ] DietPi-Bugreport | Root access verified. [ OK ] DietPi-Bugreport | RootFS R/W access verified. [ INFO ] DietPi-Bugreport | Packing upload archive, please wait... [ OK ] DietPi-Bugreport | Connection test: ssh.dietpi.com [ OK ] DietPi-Bugreport | Bug report sent, reference code: 1651785e-3477-409e-9388-3a79306c80a6

diveyez commented 5 years ago

I am getting a massive issue relating to NTP desync due to Softclock emulating Hardwareclock with the Rpi3bv and DietPI. Is there a way around this?, because I cannot even post boot at this point.

MichaIng commented 5 years ago

@diveyez

NTP desync

NB: Sorry fore being most probably a bid more verbose below than required for you, as I think you already know most of this.

You mean NTP or RTC? Because these are two different things:

Since RPi does not have a real RTC, it has the fake-hwclock package pre-installed. It just saves the time on shutdown to a file and loads this timestamp as system time on boot. So early boot messages will still have wrong timestamps (no change to get around this!), the ones from last shutdown, whenever this was, but at least they are not dated to 01/01/1970 😉.

After network is up, the system time will be updated of course, but no change to control the timestamps before that.

So could you specify which of the above did not work as expected?

MichaIng commented 5 years ago

Further reading about the topic: https://unix.stackexchange.com/questions/442698/when-i-log-in-it-hangs-until-crng-init-done#442744

On RPi:

On VM Jessie:

root@VM-Jessie:~# dmesg | grep random
[   70.189616] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available)
[   71.338422] random: systemd: uninitialized urandom read (16 bytes read, 30 bits of entropy available)
[   71.547181] random: systemd-sysv-ge: uninitialized urandom read (16 bytes read, 39 bits of entropy available)
[   72.085517] random: systemd: uninitialized urandom read (16 bytes read, 68 bits of entropy available)
[   72.086023] random: systemd: uninitialized urandom read (16 bytes read, 68 bits of entropy available)
[   72.086609] random: systemd: uninitialized urandom read (16 bytes read, 68 bits of entropy available)
[   72.107337] random: systemd: uninitialized urandom read (16 bytes read, 68 bits of entropy available)
[   72.109459] random: systemd: uninitialized urandom read (16 bytes read, 68 bits of entropy available)
[   72.109521] random: systemd: uninitialized urandom read (16 bytes read, 68 bits of entropy available)
[   72.141338] random: systemd: uninitialized urandom read (16 bytes read, 69 bits of entropy available)
[   73.472029] random: nonblocking pool is initialized

On VM Buster:

MichaIng commented 5 years ago

Bug report on Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=898814

So as can be derived above, the random seeding is done muuch faster at boot with haveged. However on my test systems this has no effect on boot or service starts, even with php7.3-fpm installed. So the real question is in which cases boot or services wait for crng init done and why nothing waits for it on my testing systems.


BOOM: I was always wondering why my Buster VM after most boots at first has no network access. It has but it waits very long on the first connection attempt. Afterwards everything works fine. Now I checked and indeed the download starts right after random: crng init done! So on Buster network connections seem to wait for this. This error was already annoying me for a long while. Service starts worked very well, but only this network hang happens.

MichaIng commented 5 years ago

Checking influence of attached virtual hardware on VM (VirtualBox)

No display, No USB, no audio, no serial, 1 drive

root@VM-Stretch:~# dmesg | grep random
[    1.606379] random: systemd-udevd: uninitialized urandom read (16 bytes read)
[    1.606712] random: systemd-udevd: uninitialized urandom read (16 bytes read)
[    1.606768] random: systemd-udevd: uninitialized urandom read (16 bytes read)
[    2.678331] random: fast init done
[ 1024.510594] random: crng init done
[ 1024.510796] random: 7 urandom warning(s) missed due to ratelimiting

No display, No USB, no audio, no serial, 2 drives

[  997.893151] random: crng init done
[  997.893177] random: 7 urandom warning(s) missed due to ratelimiting

Display, USB, audio, 1 serial, 2 drives + 1 ROM

[  209.549809] random: crng init done
[  209.549830] random: 7 urandom warning(s) missed due to ratelimiting
MichaIng commented 5 years ago

Issues seems to be especially present on Raspberry Pies and virtual machines: https://lists.debian.org/debian-devel/2018/12/msg00293.html EDIT: Ah nope, Odroid mentioned above as well and my Notebook also has very late random init without haveged.

So if we add this by default, then only for those two deices for now.

What I aim to add as well is a log for the crng init done event, although don't know yet how to track, perhaps udev can track it. Then adding this (simple timestamp in seconds after boot) to dietpi-survey and calculate average values for each device on report page. So we get an overview. Also DietPi-Banner MOTD can show a warning/hint when the value is e.g. > 10 seconds.

It is quite a pain that related issues are that drastic, hanging boot, failing services, and no real logs that allow to track down/debug the issue to entropy. This is nothing that I want to ship to our users, regardless of Debian arguments/security concerns.

MichaIng commented 5 years ago

Checked the bug reports. Many thanks again for all that send one after entropy being identified as issue!

All systemd have PHP7.3 installed, so indeed this seems to have a high demand of entropy. But not all PHP7.3 instances are affected by service (re)start hangs. Half of reported systems have OpenVPN installed, half unbound, half netfilter-persistant (overlapping in cases). It seems that those as well have a high entropy demand, in case of OpenVPN expectedly. So these combinations seem to trigger the significant PHP7.3 hang on boot.

At least one system is not headless, has xserver + desktop installed. So we cannot assume that only headless systems are affected, even that as tested above keyboard input seems to full the entropy pool. We want the system + services to start without hanging until login prompt, so before any keyboard input.

Another issue is plymouth. It is installed on half affected systems as well and seems to cause the boot hanging, probably outside of PHP7.3. It contains of 13 static systemd units, being called on boot and shutdown/reboot/halt, invoking each other.

What especially took my attention is:

cat /lib/systemd/system/multi-user.target.wants/plymouth-quit-wait.service
[Unit]
Description=Hold until boot process finishes up
After=rc-local.service plymouth-start.service systemd-user-sessions.service

[Service]
ExecStart=-/bin/plymouth --wait
Type=oneshot
TimeoutSec=0

And finally when searching the forum for related posts: https://dietpi.com/phpbb/viewtopic.php?p=7598#p7598

WARN: Low entropy detected. We recommend you to install `haveged'
wlan1: interface state UNINITIALIZED->HT_SCAN
20/40 MHz operation not permitted on channel pri=1 sec=5 based on overlapping BSSes
Using interface wlan1 with hwaddr XX:XX:XX:XX:XX:XX and ssid "DIETPI"
random: Cannot read from /dev/random: Resource temporarily unavailable
random: Only 17/20 bytes of strong random data available from /dev/random
random: Not enough entropy pool available for secure operations
WPA: Not enough entropy in random pool for secure operations - update keys later when the first station connects

And not to forget general network hang on Buster until random crng init done.


Putting all together the causing and affected services are wide and related issues often critical from end user perspective. Although headless RPi systems are mostly affected, also Odroids, VMs and non-headless systems can be. And there is no reliable way to detect if the hardware is able to produce enough entropy and the kernel even does it, e.g. Debian default kernel does not trust CPU as entropy source (see above links). Issues are very hard to debug as mostly no related logs are produced besides the late random crng init that one must associate with the issues by hand.

=> We will install haveged on all DietPi systems from now on. The process takes less than 1M memory (the disk usage is again much smaller) and the benefit of assuring that none of the above issues occur simply outweighs it. => As said somewhere above, rng-tools is an alternative but it depends on the hardware ("some Intel/AMD/VIA chipsets") which we cannot rely on.


DietPi-PREP: https://github.com/MichaIng/DietPi/commit/66c32d39e5eb80fe0b22b22c20fa2e710ac1af44 DietPi-Patch: https://github.com/MichaIng/DietPi/commit/ed7221d4fe606a915a60ad0ae07abf5d1601d611 Changelog: https://github.com/MichaIng/DietPi/commit/333a3d823837a3f7d1f54228d94c8b3650c41b92

MichaIng commented 4 years ago

Just as I stumbled above:

Good to know its well known on Raspbian as well, leading to the fact that rng-tools is pre-installed on Raspbian builds since April.

diveyez commented 4 years ago

MicroSD card reader broke, I am probably not going to be working with Pi's and SBC's for a few weeks. Will check back in a few weeks...