anholt / linux

Other
134 stars 24 forks source link

bcm2835-power: Timeout waiting for grafx power OK since kernel 5.1 #153

Closed jianhuilee closed 4 years ago

jianhuilee commented 5 years ago

Hi, After upgrading my rpi2 to kernel 5.1.9 from 5.0.9, vc4 is not working and dmesg shows "bcm2835-power bcm2835-power: Timeout waiting for grafx power OK" while loading. Kernel is built by multi_v7_defconfig configuration.

config.txt dmesg.txt

lategoodbye commented 5 years ago

I know these timeouts, but i wasn't able to reproduce them for your case.

Could please add initcall_debug to the cmdline and attach the new dmesg output? https://elinux.org/Initcall_Debug

Edit: Do you use Rasbian as a rootfs? Do you use a specific bootloader? Did you modified cmdline.txt?

jianhuilee commented 5 years ago

Hi, There you go. Thanks.

Do you use Rasbian as a rootfs? => Rasbian and nfs mount for rootfs. Do you use a specific bootloader? => No. Only update through Rasbian. Did you modified cmdline.txt? => Yes for nfs mount. Attached my cmdline.txt for your reference.

dmesg.txt cmdline.txt

lategoodbye commented 5 years ago

Does the issue still occur, if you mount the rootfs via SD card instead of nfs?

jianhuilee commented 5 years ago

The issue is still there via SD card. I also attached dmesg log under 5.0.9, but not sure if it will help to clarify because it uses different power domain driver after all. dmesg-5.0.9.txt dmesg-5.1.9.txt

I will try a fresh rasbian system later to see if there is something missing. Update: After a fresh install SD card and follows by dist-upgrade, the issue still happens after changing the kernel to 5.1.9.

lategoodbye commented 5 years ago

I don't have a clue what causes the different behavior. I also tested Raspbian and Raspbian Lite without any issue.

Could you please dump the content of the following debugfs entries: /sys/kernel/debug/devices_deferred /sys/kernel/debug/clk/clk_summary /sys/kernel/debug/pm_genpd/pm_genpd_summary

jianhuilee commented 5 years ago

Here is the log for the debugfs entries: log.txt

lategoodbye commented 5 years ago

Thanks. In order to exclude any conflicts between the old and new power driver, could you please test the following patch:

https://gist.github.com/lategoodbye/a76912da2aaea889a1b5c041109b7ab1

You only need to rebuild and replace the dtb.

BE AWARE: this is only for testing and breaks USB

jianhuilee commented 5 years ago

log for the entries and dmesg: log-patch1.txt dmesg.txt

lategoodbye commented 5 years ago

Okay, could you please undo the last patch and apply this one to get more debug information:

https://gist.github.com/lategoodbye/bbd8165be409b62aec28f9f43f20b5ec

This requires a new kernel and restore the old dtb.

The dmesg output without initcall_debug would be interesting.

jianhuilee commented 5 years ago

Please check: dmesg.txt

lategoodbye commented 5 years ago

Unfortunately i didn't see any problem. I added another patch to the last gist, which adds more debug to the ramp-up code and increases the timeout. But it's a shot in the dark,

jianhuilee commented 5 years ago

Looks like the increased timeout is not working :( dmesg.txt

I will find another pi2 to do a cross testing.

lategoodbye commented 5 years ago

@anholt It seems that even the initial level of 3.5 mA grafx doesn't stabilize. Is there any register we should dump in this case? Is there another dependency we should check?

@jianhuilee That would be great. Another idea would be to cross check the kernel binaries to exclude any toolchain issues.

anholt commented 5 years ago

@lategoodbye I think the theory is that you're supposed to inrush as little as possible, but may need to inrush more to succeed. But if we're failing at all levels, that makes me think we're missing something else.

lategoodbye commented 5 years ago

Yes, the inrush failed on all levels:

jianhuilee's RPI 2 (bad case):
[    8.963502] bcm2835-power bcm2835-power: bcm2835_power_pd_power_on: 0
[    8.963529] bcm2835-power bcm2835-power: Before: GRAFX 00001000 V3D_S_CTRL 00000007 V3D_M_CTRL 00000007
[    8.963547] bcm2835-power bcm2835-power: Timeout #1 waiting for grafx inrush 0
[    8.963562] bcm2835-power bcm2835-power: Timeout #2 waiting for grafx inrush 0
[    8.963578] bcm2835-power bcm2835-power: Timeout #1 waiting for grafx inrush 1
[    8.963595] bcm2835-power bcm2835-power: Timeout #2 waiting for grafx inrush 1
[    8.963609] bcm2835-power bcm2835-power: Timeout #1 waiting for grafx inrush 2
[    8.963626] bcm2835-power bcm2835-power: Timeout #2 waiting for grafx inrush 2
[    8.963640] bcm2835-power bcm2835-power: Timeout #1 waiting for grafx inrush 3
[    8.963657] bcm2835-power bcm2835-power: Timeout #2 waiting for grafx inrush 3

-> never powok

lategoodbye's RPI 2 (good case):
[   10.011450] bcm2835-power bcm2835-power: bcm2835_power_pd_power_on: 0
[   10.011470] bcm2835-power bcm2835-power: Before: GRAFX 00001000 V3D_S_CTRL 00000007 V3D_M_CTRL 00000007
[   10.011488] bcm2835-power bcm2835-power: For grafx inrush 0 finished after 1615 ns
[   10.011502] bcm2835-power bcm2835-power: For grafx inrush 1 finished after 937 ns
[   10.011514] bcm2835-power bcm2835-power: For grafx inrush 2 finished after 938 ns
[   10.011527] bcm2835-power bcm2835-power: For grafx inrush 3 finished after 885 ns

-> powok for every inrush level

I added an additional timeout of 20 us, which also failed. Since i saw this issue before i don't believe it's an hardware issue.

anholt commented 5 years ago

Wait a second. I'm not breaking out of the inrush loop on powok?

lategoodbye commented 5 years ago

I updated my trace with a good / bad case comparison (both boards use the same kernel version and the same config). So i don't know the reason why powok never came on @jianhuilee board. I could extend my dumps, but i don't know which registers are relevant here.

Edit: To answer your question: the implementation of the inrush loop looks correct to me.

jianhuilee commented 5 years ago

Just used my another rpi2, and it is working on the kernel 5.1.9! dmesg-ok.txt

But the failed one is working when it is running kernel 5.0.9. Not sure what is the difference between them.

lategoodbye commented 5 years ago

@jianhuilee I assume, you used the same SD card for both RPi 2?

jianhuilee commented 5 years ago

@lategoodbye Yes. Everything conditions are the same. The only difference is the rpi2 board.

lategoodbye commented 5 years ago

I'm not sure this is related but on the bad case RPi 2, the runtime status of the firmware controlled power domains VEC and HDMI were "unsupported", but not on my RPi 2.

jianhuilee commented 5 years ago

After some tracing, I guess the runtime status of both VEC and HDMI which is "unsupported" is because pm_runtime_enable() is called in vc4 driver. But since vc4 is not being enabled, the status keeps unsupported. Please correct me if I'm wrong. If you have any other things which need to be checked, please tell me.

lategoodbye commented 5 years ago

I've added another patch to my last gist, which try to dump all relevant registers before grafx power is enabled. Maybe we see any difference between good and bad case.

jianhuilee commented 5 years ago

Please check: dmesg.txt

lategoodbye commented 5 years ago

Sorry, i couldn't find any difference between them.

lategoodbye commented 5 years ago

Here is another user with the same timeout (Rpi 3 using bcm2835_defconfig): https://github.com/raspberrypi/linux/issues/2971

jianhuilee commented 5 years ago

Hi @lategoodbye Could you suggest a way to dump the registers but using raspberrypi-power driver? I planned to see the difference in the registers by using raspberrypi-power driver on kernel 5.1.9, but really could not find a easy way to dump the registers like you did between this two power drivers.

lategoodbye commented 5 years ago

The raspberrypi-power driver doesn't have access to this memory region. But i don't understand what you are trying to achieve. raspberrypi-power and bcm2835-power have completely different behavior. And we noticed that the register values before the issue are exactly the same in bad and good case.

jianhuilee commented 5 years ago

My thought was since firmware could successfully power up vc4 on the bad case pi, maybe there is worthy to see the register values? Or firmware does not use the same registers?

lategoodbye commented 5 years ago

At least two other boards (3B+) are affected by this issue: https://lists.fedoraproject.org/archives/list/arm@lists.fedoraproject.org/thread/5XUHBKJ72VMVRR646ZUXVAIVPS42MBWD/

lategoodbye commented 5 years ago

@jianhuilee I think, i understand what you are trying to achieve. You want to compare the registers for the cases: good case (only raspberrypi-power) vs good case (raspberrypi-power + bcm2835-power). In this case we better use some kind of low-level memory dump tool from userspace.

This approach assumes that we missed a register setting (and this is a software issue). Sorry, i don't have any documentation about this part of the BCM2835.

ozbenh commented 5 years ago

Same problem here with my Rpi2 running Fedora 30. Working with 5.0.9, broken by the update to 5.1.12.

ozbenh commented 5 years ago

Hrm... I'm observing the red LED blinking a few times during the kernel boot, I wonder if there's just not enough juice coming on USB. Though I don't know why it would work with the older kernel and the same power source.

I do see:

[ 25.829169] hwmon hwmon1: Undervoltage detected!

About 1s after the power driver timeouts. I see a bit later that is goes back to normal:

[ 27.877181] hwmon hwmon1: Voltage normalised

I tried another USB source with the same results, I'll try a few more tomorrow as I round up some >2A adapters (if I can find some).

lategoodbye commented 5 years ago

@jianhuilee Did you test the working RPi2 with the same PSU and micro-USB cable from the affected RPi2?

jianhuilee commented 5 years ago

@lategoodbye Yes, same PSU (Max: 2A) and same micro-USB cable. Under the testing, my rpi2 is not connected to any peripherals, hdmi or ethernet. The boot-up and stand-by current are below 0.5A. On the bad case rpi2 using raspberrypi-power, the vc4 could be powered up successfully, so I presume the PSU is fine here.

ozbenh commented 5 years ago

So I got a power adapter sold as "2.4A", it's fun, that one has the RED light going off a few seconds into the boot process and stays off, the voltage doesn't keep up :-) The Rpi2 otherwise seems to function normally, and the exact same issue occurs with graphics.

I got another 2A one, that one has the red light steady except with a small glitch during the boot process which may be when the power driver kicks in, it's hard to tell at this point, I'll get a serial adapter to try to check, but otherwise same behaviour.

lategoodbye commented 5 years ago

Could anyone who is able to reproduce this issue, please try the 5.1 branch of the downstream tree? https://github.com/raspberrypi/linux/tree/rpi-5.1.y

jianhuilee commented 5 years ago

dmesg-5.1.y.txt I cannot see any log from bcm2835-power to enable vc4, but it looks vc4 is probed with other error. Both good case and bad case pi are the same. I built with the bcm2709_defconfig kernel configuration.

warpme commented 5 years ago

Just another data-point: I built https://github.com/raspberrypi/linux/tree/rpi-5.2.y and I'm getting bcm2835-power: Timeout waiting for grafx power OK on my rpi2-b.

lategoodbye commented 5 years ago

@warpme Please report this here https://github.com/raspberrypi/linux/issues/3046

tomty89 commented 4 years ago

Wait a second. I'm not breaking out of the inrush loop on powok?

Edit: To answer your question: the implementation of the inrush loop looks correct to me.

@anholt @lategoodbye I was looking at bcm2835_power_power_on, and it seems to me that the for/outer loop would not be broken out of anyway. (There should be another break that checks whether powok is true after the while/inner loop, otherwise PM_WRITE will be done again and again with even higher inrush until all of the possible values are looped through, even if we get a true powok.)

tomty89 commented 4 years ago

Btw, if we have the above fixed, is it even expected that we could see more than one bcm2835-power: Timeout waiting for grafx power OK? The difference here intrigues me: https://github.com/anholt/linux/issues/153#issuecomment-502802752 with that in the bad case each inrush is tried twice. (Some sort of race/collision?)

EDIT: Actually, shouldn't we always see at most one instance of it at a boot anyway?

lategoodbye commented 4 years ago

(There should be another break that checks whether powok is true after the while/inner loop, otherwise PM_WRITE will be done again and again with even higher inrush until all of the possible values are looped through, even if we get a true powok.)

That's the whole point behind a rampup. So this behavior is intended. The only issue about the for loop is that we don't break the in case powok is false. But this isn't the problem here.

Edit: The goal is powok for the maximum.

tomty89 commented 4 years ago

Is it though? The comment stated that:

    /* Enable power.  Allowing too much current at once may result
     * in POWOK never getting set, so start low and ramp it up as
     * necessary to succeed.
     */

Sounds to me that we only need to PM_WRITE with a higher inrush if we do not get powok (within a timeout) with a lower one.

And either way, it still doesn't explain why the timeout error would appear more than once. Shouldn't bcm2835_power_power_on be called only once (for PM_GRAFX, that is)? Doesn't multiple instances of that error indicates something not being right?

lategoodbye commented 4 years ago

My understanding that POWOK is a flag that confirms that we finally reached the inrush. But we always need the maximum level.

The multiple occurences of the timeout comes from the deferred probe of the VC4 driver.

Sorry, i don't have any documentation of this block.

tomty89 commented 4 years ago

The multiple occurences of the timeout comes from the deferred probe of the VC4 driver.

But why is the number of occurrence indeterministic (yet finite)? What exactly causes it to start attempting and what makes it stop?

tomty89 commented 4 years ago

So I just realized the vc4 driver was built as module here. If I load it manually after boot with modprobe, the error still always show up twice. Why twice?

lategoodbye commented 4 years ago

In the upstream kernel the suggested patch to revert has been applied. https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20200402&id=e7b7daeb48e0bf5d8412d77f11069750ee7032bb

I will close this, because i don't have any idea how to fix this issue.