RobertCNelson / omap-image-builder

omap image builder
http://elinux.org/BeagleBoardUbuntu
MIT License
152 stars 201 forks source link

AM335x Debian Bullseye image boot is extremely slow compared to Stretch #138

Closed willeccles closed 3 years ago

willeccles commented 3 years ago

(Apologies if this is not the proper place for this issue, but I am not really sure where it belongs.)

Previously, we were using the pre-built eewiki Debian Stretch image (I believe 9.8 or so). After modifying the eewiki Debian Buster build config to support Bullseye, I am getting the following boot times (abbreviated):

1min 10.947s ifupdown-pre.service                
     57.756s dev-mmcblk0p1.device                
     16.160s generic-board-startup.service       
      4.910s nginx.service                       
      4.534s user@1000.service                   
      3.955s dnsmasq.service                     
      2.672s systemd-udev-trigger.service        
.....

From what I can tell, it looks like ifupdown-pre.service is waiting for can0 to come online, which takes a very long time (not sure if related, but it is significantly longer than on Stretch). Also of note: haveged fails to start properly, as it apparently uses the uname(2) syscall, which is filtered by its systemd unit's seccomp filter. For the time being I have removed the filter, as in the future it's likely I will remove haveged unless there is a good reason to keep it around (I see no difference in the system with or without it). Also, this boot time does not depend on the SD card used. Whether I boot from MMC or eMMC it takes just the same amount of time.

Again, I'm not sure exactly where the right place for this issue would be, but unfortunately I need to update this system to Bullseye and am unable to make it boot in an acceptable amount of time. Any thoughts?

Additional Details

The system is based on an OSD3358 on custom hardware. The kernel is 4.19.180-bone-rt (built with bb-kernel). We are using the same u-boot image and kernel image as before (when using Stretch), which is version 2019.04 plus various patches (our own + eewiki).

willeccles commented 3 years ago

I'm going to try upgrading to kernel 5.10-bone-rt, as it seems upgrading the kernel has helped such situations in the past (https://github.com/RobertCNelson/linux-dev/issues/45). I know this is rather bleeding edge for such a system, so I was expecting some growing pains, but we'd really like to be able to ship this system with Bullseye on it, so I will keep trying.

RobertCNelson commented 3 years ago

HI @willeccles , sorry i haven't really started generating Bullseye images yet.

Now that we are in the hard freeze, https://wiki.debian.org/DebianBullseye i'll start looking it building it..

i've got a couple devices running it, have you tried v5.4.x-rt or v5.10.x-rt ?

willeccles commented 3 years ago

I have just gotten 5.10.21-bone-rt-r28 built and flashed. It appears that one GPIO chip was not successfully added, so it's likely our device tree requires updates from kernel 4.19 to 5.10 (I already had to change one thing to get it to successfully build the dtb, so it's not out of the question). For the moment, our image-builder project configuration is almost entirely identical to the eewiki-minfs-buster one, except for one additional package.

Also of note, regarding the original post: haveged seems to work out of the box on kernel 5.10, so it's likely that the seccomp issue I encountered with it previously had to do with using an older kernel than it was built for.

willeccles commented 3 years ago

Additionally, our boot is still extremely slow. It appears also that ifup@eth0 is failing: RTNETLINK answers: file exists. This might have something to do with connmand. It also appears to be waiting quite some time for serial-getty@ttyGS0, since /dev/ttyGS0 does not exist.

RobertCNelson commented 3 years ago

yeah, libcomposit is slow on loading.. back in 4.19.x i had tried building it as built-in, to try and speed that up, but it was even more broken..

willeccles commented 3 years ago

Alright, on 5.10 I am unable to get GPIOs 0-31 to register. The kernel repeatedly posts the same log message PM: Cannot get wkup_m3_ipc handle. This happens a few times, and then after the GPIOs have failed and many other devices have been brought up, it then says wkup_m3_ipc 44e11324.wkup_m3_ipc: CM3 Firmware Version = 0x192. I've included an image showing the relevant sections of the boot log. Any idea what could be causing this? I am not sure if these two things are related, but they seem quite coincidental if not. image

willeccles commented 3 years ago

Ah, never mind. That's -EPROBE_DEFER, and you can see it is deferring until later and successfully setting up. Very strange that our hardware isn't working then.

willeccles commented 3 years ago

@RobertCNelson I have spent the last few work days poking around trying to figure out why our boot was so incredibly slow. First, I removed most of the generic-board-startup stuff (not because it was inherently extremely slow, but because we don't need it for our purposes). Then, I messed with our network and TTY config. I got the boot down to about a minute and a half, which was still wildly unacceptable, especially compared to our old image using Stretch. However, after trimming down the packages and testing, it turns out that the rules /etc/udev/rules.d/*noroot.rules are extremely detrimental to performance. After removing those, our boot is down to 20 seconds, which is much closer to where we want to be. After checking with udev's debug logs (which are needlessly verbose and spammy), it turned out it was doing so much chmod-ing and chown-ing that it was fighting the rest of the system for resources, and as a result it was taking ages (up to a minute and three-quarters or so) to bring all devices online, resulting in everything else waiting for devices to come online. The biggest ones were network devices (eth0 and can0) and ttyS0, which took ages to come online, and sometimes the services waiting for them would time out before the devices were brought up.

I am not sure where the repository for these rules is, but if you can point me to it I would gladly open an issue or something like that if necessary. At the very least, this would be good to have documented somewhere, as the boot times are unacceptable for many (I have seen this issue countless times on the internet, but I think I am the first person to find this solution).

I think it's entirely possible that these rules were updated since the Stretch ee-wiki image, or something may have updated with udev to make this process so much slower. It does seem that it was doing a lot more work than it had to, which may be a side effect of the kernel upgrade.

Side note: I did not ever fix the gpiochip situation, but I did figure out that it is in the wrong order consistently, so I adjusted our pin macro to accommodate (instead of c * 32 + p it's something like ((c + 3) % 4) * 32 + p). This is not related to this issue, but if someone finds this issue in the future while searching for a gpiochip ordering issue, there's my band-aid for that problem pending a proper driver patch.

willeccles commented 3 years ago

Looks like disabling the repo_rcnee option fixed this, so I am assuming the rules are from bb-customizations, though I did not check with apt to figure it out, so YMMV.

willeccles commented 3 years ago

Okay, so to summarize my findings for the future if people are looking to use Bullseye:

  1. At the moment, kernel 5.10 is not supported by TI, so if you have any issues, do not expect them to help. When asking about the GPIO chip issue, I received a response that effectively said that they don't support anything there is no TI SDK for and if I couldn't reproduce the issue on 4.19 they would not help me. Upgrade with caution, this is unsupported territory for TI EVMs, at least.
  2. The udev rules added by bb-customizations are abysmal for boot time and system performance as a whole, so it's likely that they are somehow incompatible with a newer udev. When I enabled debug logging in /etd/udev/udev.conf, it was overflowing my kernel log every 2 seconds with how often it was calling chown, ghgrp, etc.
  3. Bullseye with 5.10 is functional on an OSD3358 in general. Note that some things in configurations will have to change: between 4.19 and 5.10, the default device trees for the AM33xx and relevant boards were completely overhauled, so if you include them you will likely have to adapt your device tree a tad. I believe the way the GPIO is defined in arch/arm/boot/dts/am33xx-l4.dtsi is responsible for making the GPIO come online incorrectly and be improperly ordered, though I have not tried modifying this file to fix it, so that may be wrong.

Otherwise, I have had no issues and have simply had the usual growing pains that come with upgrading Linux/Debian versions. The build scripts still work well, though there is one issue, which I will open a separate issue about.