OpenVario is in an instable state again

August2111 commented 7 months ago

With PR #334 from Torsten (tb59427) the OV is coming back to a very stable running state like the old releases before (f.e. the 17119). With the PR #362 the instable state is coming back: From my point of view because in #362 the min and max kernel frequency are not the same, the kernel is switching between different frequencies - and so we have the similar problem like before... I think, there is no need to increase the frequency - und so I propose to revert back the PR #362 urgently! I did this in my private repository since February - and don't have any problems with!

mihu-ov commented 7 months ago

I am not sure I understand that correctly. Did you experience any actual stability issues with #362 that were fixed with reverting #362? Or do you assume there could be stability issues with #362?

August2111 commented 7 months ago

Yes, you understand this correctly: I picked up this #362 around November 2023 - and with this a had nearly the same issues with OpenVario freezes like 2021/2022 with at least 4 - 5 devices. In February I reverted this PR - and the problems are gone. What I'm missing is a wide test base for this stability issue with the #362 like it Torsten Beyer done with #334 - and I cannot believe that the comment to this #362 is really true! To me it looks more or less like an estimate without a test basis! And so it is for me the best way to go back to the previous situation!

mihu-ov commented 7 months ago

@bomilkar @tb59427 @linuxianer99
What is you opinion on this issue / reverting #362 ?

tb59427 commented 7 months ago

I think it would be worth the while....I still think it's the speed-cycling AND the voltage that causes the issue. And realistically openvario doesn't need much cpu power anyway....

-- Torsten Beyer / +49 175 9307483 (tel:+49%20175%209307483)

Am 8. April 2024 um 09:17:06, mihu-ov @. @.)) schrieb:

@bomilkar (https://github.com/bomilkar) @tb59427 (https://github.com/tb59427) @linuxianer99 (https://github.com/linuxianer99) What is you opinion on this issue / reverting #362 (https://github.com/Openvario/meta-openvario/pull/362) ?

— Reply to this email directly, view it on GitHub (https://github.com/Openvario/meta-openvario/issues/372#issuecomment-2042026134), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AFP4TOZZNWZ7WZ324LKYKWLY4I77FAVCNFSM6AAAAABFVG7TDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBSGAZDMMJTGQ). You are receiving this because you were mentioned.Message ID: @.***>

bomilkar commented 7 months ago

I'm surprised to see the issue reappears after it seemed quiet for years. That looks strange. Before we change anything we should analyze if the frequency and voltage limitations are actually implemented. That can be done by (periodically) doing something like this:

#!/bin/bash

while true
do
cpuF0=`cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq`
cpuF1=`cat /sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_cur_freq`
cpuV=`cat /sys/devices/system/cpu/cpu0/supplier:regulator:regulator.9/supplier/microvolts`
tmpZ0=`cat /sys/class/thermal/thermal_zone0/temp`

 echo -n "Time: "
 date
 printf "CPU freq: 0: %d MHz, 1: %d MHz\n" $(($cpuF0/1000)) $(($cpuF1/1000))
 printf "CPU0 voltage: %.2f V\n" $(($cpuV/1000))e-3
 printf "Zone0 temp: %.1f C\n" $((tmpZ0))e-3
 sleep 10
done

Watch it to see if:

the voltage is what it's supposed to be and never changes!
the frequency never exceeds the limit.

I don't quite agree to "openvario doesn't need much cpu power anyway". It does when it loads a different map file and, more commonly, when you zoom out an a busy map. You can get it to freeze at 100% CPU if you zoom out too far. (That's why I apply a patch to limit the zoom to less than 250km.)

tb59427 commented 7 months ago

I can't confirm the need for CPU power (but this "need" is probably driven a lot by personal expectations) I have been flying with OV (but without #362 applied) for over a year now, and found it to be absolutely snappy and in no way sluggish or slow. When I fly in France, I tend to have a fairly busy map (I just can't remember all those cols and montagnes :-P) and have no trouble with it at all.

I remember that I experimented with just fixing voltage or just fixing cpu speed. And while the stability increased, there still were crashes (I can't provide details as don't have my logs available right now - I am in Serres and only have my iPad with me). The only way I could get to a really stable OV was fixing both: cpu and voltage.

August2111 commented 7 months ago

In principle I can only agree with Torsten, thank you again for his extensive analyses, tests and their communication! In my opinion, he has ensured that the conventional OpenVario either has to be thrown into the scrap heap or can only be operated with the 2017 version :)

I'm surprised to see the issue reappears after it seemed quiet for years.

The PR #362 isn't that old: it was only released on September 19th. merged - so it was basically not available in the last (European) season. Even in the season in the southern hemisphere - with probably fewer devices than here - the finished images from last year were certainly used, the last 'official' release image from OpenVario is from December 2022 - and the extended releases from Blaubart and I had! This patch is not included (i.e. I had it in, but removed it again at the end of January - due to these problems).

My observation is basically limited to 2-3 devices that I use for development in the winter - and to a device from a pilot friend who has subjected the whole thing to a long-term test for himself: OV supplied with voltage - and imported some Flarm data . We discovered together that the frequent freezes were gone again after PR #362 was expanded. As I said, just an unbiased observation! We've been trying this out for about 2 months now, and since then my pilot friend's vulnerable device has no longer shown any freeze failure!

In general, it should be noted that this effect occurred very differently in the past: with some devices it stopped after 1 - 2 hours at the latest, with some devices it happened relatively rarely - and in my opinion, with some devices not at all! The manufacturing quality of the Cubieboards also seems to have deteriorated significantly: Stefan Langer, for example, reports that he was no longer able to get many of the boards from the last delivery to work. A counter-test with 4 of my boards showed: These ran for hours without any problems with the internal Android, but not at all or only for a very short time with the OV SD card... In this respect, the following applies to me: If this works on a device with PR #362, that doesn't mean that all devices can tolerate it...

In addition, Bomilkar's test doesn't help me: As soon as the board freezes, its debug information also goes blank!

For this reason I decided to reopen this issue here! In my opinion, another solution should (and must) be found for the other problems that Bomilkar addressed here - which I can well understand, by the way! In any case, changes in the most critical infrastructure - the kernel - always require an extensive(!) field test in my opinion - and should not be incorporated into the system straight away based on assumptions and individual tests!

By the way, complaints about something not working properly often end up with Blaubart or me because we provide the fully compiled images...

bomilkar commented 7 months ago

In addition, Bomilkar's test doesn't help me: As soon as the board freezes, its debug information also goes blank!

I didn't mean to catch the reason for the freeze. I suggested the script to see if the constraints on clock and voltage are really doing what they should.

The Allwinner A20 was never one of the cleanest designs. And I'd be surprised if the A20 is still in production. There are much better designs these days. Most likely that's the reason why Cubie2s are hard to get. Who knows, the few which are still being sold might have failed some production test(s) an now being sold from China hoping they will will work or the customers won't notice and complain.

If one board fails a different board may fail for an entirely different reason or run well (with a particular application). Unless a large number of boards fail I wouldn't spend much time on fixing the S/W. I would supply workarounds for those units that fail. As D-2402 suggested: it works if clock and voltage are fixed, at least for the issues he had seen so far.

bomilkar commented 7 months ago

Just to be clear, when I suggested a workaround I meant a script like this which can run at system start:

#!/bin/sh

# 144000 312000 528000 720000 864000 912000 960000
# max_clk=960000
# max_clk=912000
# max_clk=864000
# max_clk=720000
max_clk=528000
echo ${max_clk} > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
echo ${max_clk} > /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq
echo ${max_clk} > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
echo ${max_clk} > /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq

This would fix the clock frequency at 528 MHz.

bomilkar commented 7 months ago

In any case, changes in the most critical infrastructure - the kernel - always require an extensive(!) field test in my opinion - and should not be incorporated into the system straight away based on assumptions and individual tests!

That's why I suggested a pre-release in September 2023. "Pre-release" (as opposed to "release") means it's a test version, or when you're REALLY desperate. Don't use it if you're happy with what you have. It won't be tested enough if we don't release it to the public. However, it is important to set expectations.

tb59427 commented 7 months ago

Morning chaps,

so what's the consensus now and who takes action?

cheers -tb

August2111 commented 7 months ago

so what's the consensus now and who takes action?

Hi Torsten, this is a very good question :) As I sad: In my private workflow I removed/reverted the PR #362 - and than I see no more freezes again. I don't understand the last proposal from Bomilkar, because this will make the cubieboard a little bit slower... I'm very happy with current state with #334! The only problem I see are the crashes sometimes - this is not the freeze issue: In this case the display has colored areas and stripes and is in a complete worse state... But this happen on few boards only, must of the cubieboards don't have this issue... And this is independend of 17119 or newer versions - and so the slowering down of the kernel frequency will not help. Maybe there is a change to reinvestigate the error search on this special boards with the debug image from Torsten... Regards August

DanD222 commented 7 months ago

So, @bomilkar - would you be OK with reverting c01a30c to get the frequency out of the equation?

tb59427 commented 7 months ago

so what's the consensus now and who takes action?

Maybe there is a change to reinvestigate the error search on this special boards with the debug image from Torsten... Regards August

I can dig up the settings I used to enable the appropriate kernel messaging for debugging (dunno off the top of my hat). It would require access to such a board though. Mine doesn't seem to be affected. Just spent 11hrs in the air in France with 0 hiccups....

cheers -tb

bomilkar commented 7 months ago

So, @bomilkar - would you be OK with reverting c01a30c to get the frequency out of the equation?

I'm OK with everything that fixes the issue, of course. But the same effect as reverting that PR can be achieved by fixing the CPU clock with a script:

max_clk=720000
echo ${max_clk} > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
echo ${max_clk} > /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq
echo ${max_clk} > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
echo ${max_clk} > /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq

That's a 5 minute thing. If the affected board works with that script, then reverting the PR will most likely also fix it. But if you build a fresh image other things may change and it may work for other reasons.

August2111 commented 7 months ago

From my point of view the reverting is a better solution: It's cleaner from the workflow side, it also leaves the kernel voltage at the original value from Torsten and does the same thing to the frequencies as Bomilkar's last suggestion... and is well proved!

bomilkar commented 7 months ago

The script is not meant as the final solution. It's just to test the hypothesis on the affected board only. If the script doesn't fix the issue, then reverting the PR will most probably not fix it either. Do you understand the script?

August2111 commented 7 months ago

Do you understand the script?

Yes, of course... The script is doing the same with the frequencies like the bugfix from Torsten - but nothing more.. On effecting boards I have no issue after reverting #362 - I think we don't need additional effort with this 'new' proposal...

tb59427 commented 7 months ago

Maybe there is a change to reinvestigate the error search on this special boards with the debug image from Torsten... Regards August

Just looked my old stuff up: try these DEBUG options for the kernel (possibly add more) CONFIG_DEBUG_KERNEL CONFIG_STACKTRACE CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT

then add them as .cfg snippets to meta-ov/recipes-kernel/linux/files/ and don't forget to add them to SRC_URI

enable console output on the onboard serial port and watch :-)

August2111 commented 7 months ago

Many thanks, Torsten! Unfortunately I have my projects a little bit overloaded - but as soon as I find some time, I'll tackle it... I got a few discarded circuit boards from Stefan Langer that show such a crash pretty quickly - and are therefore basically unusable at the moment! But with the Android (Cubier) from the internal flash memory they run perfectly ;-(

tb59427 commented 7 months ago

.... think we don't need additional effort with this 'new' proposal...

So, do we have consensus on "reverting"? Who can do the reversion?

linuxianer99 commented 7 months ago

ok. i will revert in the evening ..

mihu-ov commented 7 months ago

I don´t think we will ever be able to understand all the details of frequency / voltage switching on the Allwinner processor. Maybe there´s a reason why Allwinner used fixed frequency and voltage in their kernels, we will never know.

I am also for reverting. Not because we know that #362 causes the instabilities but just to save everybody the headache of investigating deeper.

linuxianer99 commented 7 months ago

I tried to revert, but it looks like github can not da because of other changes ...

@tb59427 : May you please provide another pull request to set the voltage and frequency back to the working values ?? Thanks

tb59427 commented 7 months ago

Oh my god....you are mistaking me for someone who understands git :-) I managed to screw things up considerably during my initial attempts to create a PR for fixing this. Is there not a simpler way (for me that is :-P) of doing this?

August2111 commented 7 months ago

Maybe I can do that, because I have done a 'reverting' PR in the past in my Fork repo - with all needed changes inside - and so I have to copy it only...

tb59427 commented 7 months ago

thank you guys - I would have never managed to get github to do this for me....:-)

DanD222 commented 7 months ago

thank you guys - I would have never managed to get github to do this for me....:-)

It’s pretty easy once you understand the separation into:

the main openvario repo
your online fork of the main openvario repo
your local machine copy of your online fork

You

Fork the main repo, to an online copy/fork under your Github account
Clone your online fork to your local machine,
make the changes there,
Push the changes to your online fork,
make a Pull Request to the main repo, and that’s it.

If you need to make changes to an already existing Pull Request you Force Push to your online fork instead of Push, which overwrites whatever was in your Pull Request.

Openvario / meta-openvario

OpenVario is in an instable state again #372