Openvario / meta-openvario

Official OpenEmbedded layer for Openvario flight computer.
http://www.openvario.org
29 stars 29 forks source link

Device freezes #304

Closed hph304 closed 1 year ago

hph304 commented 2 years ago

I have a 7" OpenVario with sensorboard. After between 5 seconds and 20 minutes of operation, the screen goes black(or white, or green, or blue...) and the whole device stops responding. I am running 21118 for the CH070 screen.

After installing 17119, the issue seems to be gone, so there must be something wrong on the software side. I can pull some logs if that helps, but I will need some guidance on where to find them on the device.

hph304 commented 2 years ago

I tried v22086, which produces the same issue. Only stable version for me is 17119

OBrown92 commented 2 years ago

We have the same issue in our club glider. We powered the device with a separate battery (new) but it crashes after about 2 hours. 17119 works fine.

linuxianer99 commented 2 years ago

@OBrown92 : Which type of system do you have ?? Also CH070 ??

@OBrown92 @hph304 : i assume XCSoar is running in this 2 hours ?? Can you try, just leave the system in the Start menu ?? (i want to eliminate the mail issues ...)

mihu-ov commented 2 years ago

We should also eliminate any potential hardware issues.

@hph304 @OBrown92 Can you please elaborate on your OV hardware? DIY soldered or SteFly? Which type of DC/DC converters? Old images with 3.4 kernel had lower power consumption which may make a difference on some systems.

OBrown92 commented 2 years ago

We got the same issues in two gilders with SteFly OV. We haven't modified the DC/DC converter yet but it seems to happen also with modified one (reported in xcsoar forum). First we thought it's a power issue because it seems to happen when the radio tx or rx but we completely power it from one battery so this shouldn't be the issue. We are on a competition next week, if the weather is bad I can test some stuff like keep it in start menu.

DanD222 commented 2 years ago

Do we know at what current the reseteable fuse kicks in on the SteFly OV?

OBrown92 commented 2 years ago

Good question, don't know exactly if and where it is. We got a Stefly OV in spare, I can try to figure it out.

hph304 commented 2 years ago

I can leave it running tomorrow for 2 hours or longer. Which start menu do you mean, the OV menu? Mine is a DIY, not sure about the converter but will open it up and have a look if I can find a part number.

My device is not yet built in. I have connected it to 2 different PSUs (Basetech and Delta). The devices uses about 0.58A during operation at current Image. Once the device freezes, there is a spike to about 0.7A and it stays there.

With Image 17119 it uses 0.45A.

tb59427 commented 2 years ago

I seem to have the same issue with my vanilla Stefly 57. Should I piggy back on this issue or do you want me to open an additional one?

linuxianer99 commented 2 years ago

I think we should do it in a systematic approach... Maybe create a table where everyone can add the affected configuration, so we can eleiminate some things maybe ...

Data we would need in the table:

Hardware Variant (CH070, PQ070, etc). Is the device DIY or bought ready build ? Image version used (21xxx, 22xxx, etc). At which state happens the freeze (XCSoar running, Just in the text console (ov-menu)) Are there any external effects ?? (TX of radio (EMV)) ??

Advanced debug: Is the console still accessible (only graphic/XCSoar freeze) or is also network/serial port ? Is there any kernel panic in the logs ? (can be read out at next reboot)

tb59427 commented 2 years ago

Fair point, linuxianer99, so here we go

Data we would need in the table: Hardware Variant (CH070, PQ070, etc): SteFly 5.7 inch, no sensor board Is the device DIY or bought ready build ?: kinda half. Pre-built by SteFly and completed by me (didn't change anything). Plus rotary encoder from SteFly Image version used (21xxx, 22xxx, etc). 22086 as linked on Stefan's web page. variod disabled At which state happens the freeze (XCSoar running, Just in the text console (ov-menu)): so far never tried staying in menu or console but always started xcsoar which eventually froze after anything from a few mins to a couple of hours. Will check with just console or menu running over the weekend Are there any external effects ?? (TX of radio (EMV)) ?? No. System is not in the glider yet. Sits on my desk and powered via a 12V transformer

Advanced debug: Is the console still accessible (only graphic/XCSoar freeze) or is also network/serial port ? system is completely frozen Is there any kernel panic in the logs ? (can be read out at next reboot): will check after next crash. Should be in /var/log, no?

OV so far has been connected via /tty/S1 to an XCVario (w/ connected Flarm). Have changed this now to WIFI with a USB Wifi dongle.

Will test / have tested the following cases (may take some time, will update accordingly): (a) connected via /tty/S1 and xcsoar running: system freezes [x] (b) connected via /tty/S1 and menu running: system freezes [to be checked] (c) connected via /tty/S1 and shell running: system freezes [to be checked] (d) connected via Wifi and xcsoar running: system freezes [to be checked] (e) connected via Wifi and menu running: system freezes [to be checked] (f) connected via Wifi and shell running: system freezes [to be checked]

tb59427 commented 1 year ago

I have created a little spreadsheet on google docs (not sure everyone here likes google - but it was the easiest for the moment) where people can document their freezes along the lines of @linuxianer99's request. Feel free to add yourself. Also: feel free to spread the word to those OV-users not reading this here.

lordfolken commented 1 year ago

Please start at the minimum for testing. So no devices connected and menu, then work up from there. Also reenable the serial console and see the last messages.

tb59427 commented 1 year ago

How do I enable serial console?

linuxianer99 commented 1 year ago

How do I enable serial console?

It's always enabled ... just connect to Cubieboard port Serial 0

tb59427 commented 1 year ago

Preliminary results are in the sheet now. Seems my USB-serial adaptor is broken. Ordered a new one and will retest with console open.

mihu-ov commented 1 year ago

Before you trash your USB-serial adaptor note that only error messages show up on the serial after https://github.com/Openvario/meta-openvario/pull/265 The constant stream of error messages is no more.

tb59427 commented 1 year ago

Thanks for the hint @mihu-ov - but once I had it connected to OV I remembered that it wasn't working properly already a year ago, when I was debugging a pfsense router on a firebox.

115kBaud, 8N1 is the correct setting, though, isn't it?

mihu-ov commented 1 year ago

https://www.openvario.org/doku.php?id=projects:series_00:electrical_tests:serial_console_boot_log has some instructions that may be helpful.

lordfolken commented 1 year ago

Do we have a prompt on ttyS0? if so you can login and do a journalctl -f to get all the log messages. Then let that run until it freezes and look at the last output.

tb59427 commented 1 year ago

Based on @mihu-ov's statement re: #265 it doesn't look like there's a login on that port...provided the necessary packages are built into the OV kernel and distro we could, however, possibly enable it and do as you suggested @lordfolken

tb59427 commented 1 year ago

Still preliminary - however: it appears 22050 is stable as well as 22028 (freevario version). 22086 is crashing. So most likely something must have happened between 22050 and 22086. Not familiar enough with github to find the associated changes in between these two releases. Maybe someone better at mastering github than me could do that.

mihu-ov commented 1 year ago

22050 is day 50 in year 2022 or February 19th and 22086 is March 27th. https://github.com/Openvario/meta-openvario/commits/master shows a kernel update from 5.15.24 -> 5.15.27 and "sensord: use systemd socket activation" (but you have sensord disabled according to your spreadsheet).

tb59427 commented 1 year ago

So since sensord is unlikely to be the culprit this leaves us with the kernel update as a possible cause. How easy is it to revert the build system back to 5.15.24 but retain all other changes? If easy (and someone explains this to me in simple language) I could build an up-to-date image with the old kernel and run that as attempt for positive proof....

tb59427 commented 1 year ago

Ok, I figured the build process and built an image w/ kernel 5.15.24....let's see what happens...

Turns out that my ov image 22028 (from the freevario git repo) was built using kernel 5.10.2. So chances are that kernel 5.15.24 may not be the right version for a stable image.

Scumi commented 1 year ago

I filled my information into the spreadsheet. I flew 6.5h at the weekend (and the device was powered ~8h in total), no freezes anymore after disabling sensord, variod and pulseaudio. The Openvario was also more responsive. I am not at the latest master branch commit with my image, so I could try that (after our competition) to see if the Kernel as implied has something to do with it.

tb59427 commented 1 year ago

Added "pulseaudio" to the "running demons" section.

tb59427 commented 1 year ago

Quick request: (to whoever entered line 13 in my table): could you please check, which kernel version your OV is using (hook a keyboard to OV, go to OV menu, exit to shell and type "uname -a" (w/o ")? Or anybody else using 22050...be sure to check on the system (and not based on what should be in the image :-) ). I was using an image called ..22028.. which in reality was 21350 with a kernel version of 5.10.2 (relatively old kernel).

tb59427 commented 1 year ago

I filled my information into the spreadsheet. I flew 6.5h at the weekend (and the device was powered ~8h in total), no freezes anymore after disabling sensord, variod and pulseaudio. The Openvario was also more responsive. I am not at the latest master branch commit with my image, so I could try that (after our competition) to see if the Kernel as implied has something to do with it.

@Scumi could you please check the exact version of your kernel on the OV where everything runs fine?

tb59427 commented 1 year ago

Quick wrap up after 5 days of testing various images and some contributions from other affected OV users (see the table for details):

Tests on my Stefly OV 5.7 (no sensor board)

  1. it appears newer kernels create the freeze. If have tested w/ 5.15.{24,27}
  2. it appears kernel version 5.10.2 runs very stable (for days). I have used a build from @August2111 (22028) to verify this as well as my own builds (with kernel 5.10.2)
  3. I then checked whether newer versions of 5.10 (it's an LTS kernel) would work, too. Tested with the latest 5.10.117 and it created the freeze after a fairly long run - but still there was a freeze.
  4. I checked menu-only, shell-only usages, too. They seem to run stable w/ no freezes.
  5. Enabling/disabling variod, sensord or pulseaudiod didn't seem to make a difference

Tests from other contributors

  1. @August2111 has one 7" OV from a Czech source that runs fine w/ 5.15.27 kernel
  2. @Scumi has a 7" OV that also runs fine with a 22xxx image (and presumably a newer kernel
  3. On the other hand @Scumi also has a 7" OV (however w/ different attached devices) that freezes

So what's the conclusion up until now? For my SteFly 5.7" I can say that building OV with a 5.10.2 kernel fixes all freeze problems. Interestingly when using a Wifi connection xcsoar connects in a split second using this kernel. Also it seems that the combination of xcsoar and the "wrong" kernel creates the issue. Not sure what that may mean for isolating the issue.
Also with newer kernels Wifi connections tends to take unsually long - minutes (not over exaggerating here). Maybe this is an indication towards the core of the issue.

For the 7" display OVs the conclusion unfortunately is rather obfuscated. I don't see the slightest pattern here.

Maybe the display size (or more correctly the way it is connected to the Cubieboard) is a factor here, too.

Next Steps I am willing to take proposals from the group here. But I believe I will try and experiment a little with the various 5.10.x kernels (with x > 2 and < 117) to see whether it's possible to identify the build "where it all happened". A tedious and time consuming process. If there are volunteers to support with testing builds w/ different kernels, let me know and we can split the chores.

That's it in a nutshell :-)

hph304 commented 1 year ago

Im willing to help out as well testing kernels once I get back from work. Should be after the weekend. I do need a guide though how to build it on Windows 10.

tb59427 commented 1 year ago

Hi @hph304, which device are you using?

hph304 commented 1 year ago

It's a home-built OV with a 7" Texim and a sensorboard

kedder commented 1 year ago

Are the symptoms similar to #71? Do you see something like this in the logs?

lima 1c40000.gpu: gp error irq state=4
tb59427 commented 1 year ago

@kedder it didn't strike me. I will check the next time I run into a freeze. You could also try it the the other way around, though: build an ov image with kernel 5.10.2 and check for freezes. If my hypothesis is correct, it should work well. If not we might be back to square 1 :-)

tb59427 commented 1 year ago

Im willing to help out as well testing kernels once I get back from work. Should be after the weekend. I do need a guide though how to build it on Windows 10.

I'm afraid I am no help with Windows. Have been using Macs since 1984 and Unix boxes since 1984 starting with System 7. I guess you need a VM on Windows, install Ubuntu on it and then follow the guide here on git - at least that's what I did on my mac.

hph304 commented 1 year ago

I can run Linux on my chromebook so I'll give it a go on there

lordfolken commented 1 year ago

Some stuff to the mentioned error: https://github.com/yuq/mesa-lima/issues/36 https://www.spinics.net/lists/kernel/msg3408560.html

tb59427 commented 1 year ago

@lordfolken - ok, will have to dig deeper into this to fully understand what these guys are talking about :-)

In the meantime I have added a second worksheet to my sheet to document our testing of the 5.10.y strain of releases. When I tested, 5.10.117 was the latest release (118 now) and it failed. 5.10.2 worked. So let's start in the middle (roughly) to see whether that works or not. Means, I suggest starting at 5.10.58 (I just built the image for my SteFly 5.7 - no gliding weather here today). And see. If it fails, we'll go down halfway to 5.10.30 and check that. If it works we'll go up halfway to 5.10.84 and so on.

@hph304 would be great if you could follow that logic, too and check on your 7" OV. If others can join in, highly appreciated. Just add a column for your tests on the second worksheet.

Scumi commented 1 year ago
  • @Scumi has a 7" OV that also runs fine with a 22xxx image (and presumably a newer kernel
  • On the other hand @Scumi also has a 7" OV (however w/ different attached devices) that freezes

This is actually one device. Freezes with the services enabled, doesn't when disabled. Kernel is 5.15.27.

tb59427 commented 1 year ago

Ah, got it. Apologies for the mistake...

realtimepeople commented 1 year ago

Hi, Hopefully one of the cracks manages to spot this lousy problem ! In the meantime I suggest to use the watchdog to reset any hanging OpenVario system:

Do the following:

Create file /lib/systemd/system/watchdog.service

Content of watchdog.service: +++++++++++++++++++++++++++++++++++++++++++++++++ [Unit] Description=Watchdog reset daemon

[Service] ExecStart=/opt/bin/start_watchdog.sh Restart=on-abort

CPUSchedulingPolicy=fifo CPUSchedulingPriority=20

[Install] WantedBy=multi-user.target +++++++++++++++++++++++++++++++++++++++++++++++++

Create file /opt/bin/start_watchdog.sh

Content +++++++++++++++++++++++++++++++++++++++++++++++++

!/bin/sh

watch -n 10 /opt/bin/reset_watchdog.sh +++++++++++++++++++++++++++++++++++++++++++++++++

Create file /opt/bin/reset_watchdog.sh

Content +++++++++++++++++++++++++++++++++++++++++++++++++

!/bin/sh

echo "1" > /dev/watchdog +++++++++++++++++++++++++++++++++++++++++++++++++

Make both files executable: chmod +x /opt/bin/start_watchdog.sh chmod +x /opt/bin/reset_watchdog.sh

Start the service: systemctl enable watchdog.service

If you want to do this on your development-computer, not within the running Cubieboard system: Say

ln -s /MOUNT/lib/systemd/system/watchdog.service /etc/systemd/system/multi-user.target.wants/watchdog.service

(/MOUNT/ must be replaced with the mount position of the uSD root file system.)

This shall patch the problem for a while ... Regards Klaus

tb59427 commented 1 year ago

For 5.7 Stefly systems another option is to build OV with kernel 5.10.2. This kernel runs for days on my system. For 7" systems it's less clear. Switching off variod, pulseaudio and sensord seems to be an option.

hph304 commented 1 year ago

Ok, so I got Ubuntu up and running and I managed to build an image according to the readme.md file. How do I build OV with kernel 5.10.2? If I manage that, I'll try it on my device

tb59427 commented 1 year ago

Hi, here's how I do that find the commit ID for the kernel you want to build:

  1. for the 5.10.y line of kernels go here
  2. Search for the kernel you want to build (e.g. 5.10.2) in the search field top right
  3. klick on the name of the kernel (e.g. 5.10.2)
  4. Copy the commit ID for the kernel (e.g. for 5.10.2 it's d1988041d19dc8b532579bdbb7c4a978391c0011)

(in docker shell):

  1. cd /workdir/meta-ov/recipes-kernel/linux
  2. rename linux-openvario_5.17.5.bb to the kernel version you want use (e.g. linux-openvario_5.10.2.bb)
  3. edit the renamed file with your favourite editor (being an old unix fart I use vi - your mileage may vary)
  4. in the line starting with KBRANCH = change the branch to "linux-5.10.y" (edit: forgot this in my original post)
  5. in the line starting with SRCREV = change the id to the commit ID you copied above
  6. save file
  7. cd /workdir
  8. bitbake openvario-image
  9. Done.

The generated version will be numbered according to the naming standard (i.e. it will say something like ...22148... for today) Be sure to somewhere (in the name or wherever you like) note that this is not a "proper" build but one with an altered kernel to avoid confusing yourself :-) (happened to me, hence the warning)

Generate SD card and use it.

tb59427 commented 1 year ago

So, I have tested my SteFly 5.7 with a number of 5.10.y kernels. Details, see here. I have seen freezes with all kernel versions > 5.10.2. So, something must have happened from 5.10.2 to 5.10.3 that creates those freezes on 5.7" OVs. As indicated earlier, the picture for 7" displays is not so clear (yet).

I have started to look at the changes that happened from 5.10.2 to 5.10.3. The last kernel I looked at was 4.3BSD in the late 80's - so I am not particularily familiar with a linux kernel. Someone with much more linux kernel know than me looking at those changes would be of great help. In particular with respect to display and possibly wifi (all kernels > 5.10.2 show this strange slow connect behaviour in xcsoar).

lordfolken commented 1 year ago

https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.3 there is this in regards to wifi: 05725b40b9455d1bb36dd24a1f4b5d85e20d6c98 nl80211: validate key indexes for cfg80211_registered_device

There is also a change to the serial driver: commit e6160ad6e7294968ac446315a936477f081b09c3 Author: Alexey Kardashevskiy aik@ozlabs.ru Date: Thu Dec 3 16:58:34 2020 +1100

serial_core: Check for port state when tty is in error state

I'd start with those two.

tb59427 commented 1 year ago

Just in case anyone want's to test: I have built the latest version of OV (with the updated openembedded-core) for all supported platforms. Can be downloaded from here. There are two subfolders:

In case you want to test it, feel free to download and report back in the testing report sheet.

The vanilla build has been running on my system currently for about 4:30h - so keeping my fingers crossed. I had trouble mounting my standard USB stick with that image (mount would just hang). Worked with a different stick. Keep an eye on that one and please report that back, too. I also saw the same very slow connect from XCSoar to my XCVario via Wifi (USB dongle). Again, those of you using Wifi as well, check this and report back please.

hph304 commented 1 year ago

@tb59427 When I click here it sends me to the spreadsheet. Can you update the link so I can download the file?

MaxKellermann commented 1 year ago

You could bisect the commits between 5.10.2 and 5.10.3.