StarLabsLtd / firmware

71 stars 5 forks source link

[StarBook Mk VI - AMD] S3 suspend: failure to load GPU firmware #75

Closed tsmithe closed 8 months ago

tsmithe commented 1 year ago

On resuming from S3 suspend, the platform does not seem to load the GPU firmware correctly. See https://gitlab.freedesktop.org/drm/amd/-/issues/2403#note_1774334 for more information.

I am on the AMI firmware (since coreboot doesn't yet seem to be available, following #65), but if (when it becomes available) switching to coreboot would resolve this bug, that would be fine by me.

tsmithe commented 1 year ago

I spent some more time investigating this over the weekend, and now suspect that there's something going on in the timing of the resume process that means that this bug is sometimes present: with a "slower" kernel, things seem more stable. Much more detail reported at https://gitlab.freedesktop.org/drm/amd/-/issues/2403#note_1782282

tsmithe commented 1 year ago

Comment from @superm1 at AMD (https://gitlab.freedesktop.org/drm/amd/-/issues/2403#note_1785030):

Something I want to point out is that with Windows nearly all the Cezanne/Barcelo APUs ship with Modern Standby. S3 is relatively only utilized in Linux, and so it's the less common path from a firmware perspective. I'm not saying it's not an AGESA bug, but once the problem is outside of the kernel the OEM needs to debug it and reach that conclusion. If it's an AGESA bug after-all the OEM can work with AMD on getting a solution. Does your BIOS advertise anything about the AGESA version? I can ask an internal team to try to setup some reference hardware with the same AGESA version, BIOS configured for S3 and latest 6.1.y to see if we can reproduce.

What do you think?

Sean-StarLabs commented 1 year ago

I'll reach out to Mario :)

tsmithe commented 1 year ago

Thanks, Sean!

tsmithe commented 1 year ago

I'm sorry to say I don't think the recent firmware 1.4.0 has fixed this. I installed it this afternoon, and on my first reboot (with btusb enabled and Linux 6.2), the machine failed to resume from S3. (I also couldn't see an option in the BIOS settings to enable S2idle, so I haven't tried that; Mario suggested to me that it was necessary to change a BIOS setting in order to enable it.)

Sean-StarLabs commented 1 year ago

Ideally, it would be shutdown first but, might sound dumb, how are you suspending? I had 20 cycles (systemctl suspend) in 24 hours with 5.19.

change a BIOS setting

That's just generic advise - it's there if the OS wants to us it

tsmithe commented 1 year ago

Ah, OK (re s2idle). I'll try that.

Regarding suspend, I use the same command (or the KDE session manager). 5.19 seems more stable than 6.2. I built mainline to make sure it wasn't anything distribution-specific. I can share my kconfig if you like. (Edit to clarify: I did do shutdown first, then I booted into 6.2, then I tried to suspend, which then failed to resume.)

Here's the dmesg: dmesg-0321.log

Note that the errors are seemingly a bit different from my earlier logs (posted to the drm/amd issue tracker).

superm1 commented 1 year ago

change a BIOS setting

That's just generic advise - it's there if the OS wants to us it

So is there now a uPEP ACPI device? does fadt advertised low power idle support? Does NVME set simple suspend?

tsmithe commented 1 year ago

The amd_s2idle.py script from drm/amd reports:

❌ ACPI FADT doesn't support Low-power S0 idle
❌ PMC driver `amd_pmc` not loaded
tsmithe commented 1 year ago

I wasn't able to test Mario's ASPM suggestion (on the drm tracker), seemingly because

[    0.367914] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it

(But I'm sure you knew about that already!)

I do think this issue might still want to be marked as open, though...

Sean-StarLabs commented 1 year ago

I think I'm missing your point - why would it be need to be opened?

tsmithe commented 1 year ago

Uhh, I'm still suffering from it... but now I see I seem still to be on 1.4.0... I shall investigate!

Sean-StarLabs commented 1 year ago

Oh gotcha. If you can reproduce with the latest one, just use #76 - we only need one issue for S3 not working, regardless of the cause

tsmithe commented 1 year ago

Unfortunately, the latest firmware (1.5.0) hasn't resolved the issue: I had a failure to resume on the 3rd attempt. Here's a dmesg: dmesg-0410.log

Isn't #76 about s2idle (S0ix), not S3? (Mario suggested that S0ix was preferable these days on AMD platforms anyway...)

Sean-StarLabs commented 1 year ago

76 was s2idle support being advertised - that's done, it's just S3 now. Can you remind be what distro you are using? Need to try and replicate this

tsmithe commented 1 year ago

Ah, OK. Do you plan to support s2idle in the future? (Just curious; working S3 would be fine for now!)

I'm on Debian testing. I can mostly avoid the failures to resume by running Ubuntu's kernel 5.15.0-66-generic (from jammy and backported to focal); I haven't tried newer versions of the Ubuntu kernel.

To try to make sure that the failures aren't a distro-specific thing (and to make it easier to bisect if that became necessary), I find it easiest to reproduce the issue using my build of the 6.2 mainline kernel (torvalds tag v6.2); I just took the Ubuntu config and stripped out as many of the irrelevant modules as I could (to make it faster to build). Here's the kconfig I used: config-6.2.0-tsmithe

(I'm pretty sure though the failure to resume is nothing to do with my config, because I can also reproduce it on the standard Debian testing kernel.)

Sean-StarLabs commented 1 year ago

Ideally, we'd use s0ix as a last resort - lots of distros dont work with it at all.

I've just been testing with lunar - as I couldn't reproduce it with kinetic. I'll try mainline

tsmithe commented 1 year ago

It is mysterious that you struggle to reproduce it, because it is quite easy to reproduce here!

Sean-StarLabs commented 1 year ago

I can replicate it on Debian with that kernel... wonder if it's just a driver bug. @superm1 Would you happen to know if any drivers, microcode or anything like that could cause something like this?

superm1 commented 1 year ago

@superm1 Would you happen to know if any drivers, microcode or anything like that could cause something like this?

Not that I'm aware of. It is communicating with firmware, and getting bad results. That's why it looks like firmware issue to me.

tsmithe commented 1 year ago

According to this upstream comment of Mario's, AMD tried and failed to reproduce the failure over 500 suspend-resume cycles on reference hardware.

The fact that the bug is still present given the apparent ASPM change in the latest firmware suggests that that can't be the whole story... (Likewise, I do still occasionally encounter the bug on new kernels when btusb is blacklisted.)

I think this issue should be re-opened!

Sean-StarLabs commented 1 year ago

£5 says they didn't try with Debian and a custom kernel ;)

tsmithe commented 1 year ago

It's just mainline v6.2! You can build it with the standard/Ubuntu/Debian config, but that takes ages (even on a 5800U)!

Sean-StarLabs commented 1 year ago

Right, but 6.2 works fine on Ubuntu. All I was saying was that I doubt AMD test with the same set up; most use the latest tag for the distro and don't change anything - probably testing for Debian.

superm1 commented 1 year ago

I double checked the logs; It was 6.1.12 that was tested at that time.

tsmithe commented 1 year ago

I have also had problems on 6.1. Strange that Ubuntu kernels seem to work better.

Sean-StarLabs commented 1 year ago

Strange indeed - whilst writing these messages, I've had one on lunar with 6.2.0-20 - now on the 49th cycle, and still happy.

superm1 commented 1 year ago

£5 says they didn't try with Debian and a custom kernel ;)

I just had internal team test this. They ran 100 cycles on Debian 11 w/ a hand compiled 6.2 with the kernel config linked above and didn't hit the failure on the reference hardware running a current BIOS.

superm1 commented 1 year ago

Trying to hypothesize a difference in the test environment (aside from the WLAN card) - can you do an experiment to ONLY test suspending and resuming with power adapter connected? Can you only trigger the issue on battery perhaps?

Sean-StarLabs commented 1 year ago

Trying to hypothesize a difference in the test environment (aside from the WLAN card) - can you do an experiment to ONLY test suspending and resuming with power adapter connected? Can you only trigger the issue on battery perhaps?

Aren't all the AMD reference boards memory down?

I've been starting S3 loops with a test EC that doesn't report the battery to ACPI but the TSI/RMI changes based on charge level - I'll experiment with that.

superm1 commented 1 year ago

Aren't all the AMD reference boards memory down?

No, some do accept socketed memory.

tsmithe commented 1 year ago

Hello again... I wonder if any progress might've been made on this? Particularly given the recent finding that the bug may be related to the state of the power adapter https://gitlab.freedesktop.org/drm/amd/-/issues/2403#note_1947732

Sean-StarLabs commented 1 year ago

Interesting

Some - all being good, should have something to test tail end of this week

pastyfiend commented 1 year ago

Hi Sean, I just wondered if any progress had been made on this?

Sean-StarLabs commented 1 year ago

Yes - this one is ready to go, but I wanted to roll in the latest AGESA as it fixed 3 other bugs (2 unreported by anyone, 1 only reported about a week ago) and quite nicely boosts performance. However, that caused a regression where they won't shut down.

My plan, unless anyone calls for otherwise, is to try and figure that one out before the next version. If I'm not there by the end of the month, I'll just release it without the AGESA update.

dthuerck commented 1 year ago

If it was up to me, I'd wait - performance boost sounds promising!

Jokes aside - I've noticed that after switching (Ubuntu) to Mainline Kernels > 6.4.1, my suspend works much more reliable. Haven't encountered a single failed wakeup in weeks.

tsmithe commented 1 year ago

(I've found 6.4 quite reliable, but not perfectly: so far, it's about as good as 5.15 for me, which is what I found works best. So I'm really looking forward to the fix for this!)

Sean-StarLabs commented 11 months ago

1.8.0 now in the testing remote

tsmithe commented 11 months ago

I haven't had long enough to confirm that the system now always resumes from suspend, but here are a couple of initial observations, bearing in mind that I had previously been on v1.5.0:

  1. When resuming, it is now no longer sufficient just to open the lid: I also need to press the power button.
  2. The 'quiet' fan setting is now a bit louder (at least it spins up [much?] louder when the CPU is loaded).
  3. The battery meter in KDE now just says "estimating" instead of a remaining time. (I've been running v1.8.0 since yesterday afternoon, but I've only booted into the desktop once since installing it; maybe this is relevant.) Update: the time estimation has now returned.
tsmithe commented 11 months ago

Hello again, an update to (3): I just resumed my laptop after a night on suspend but not on charge, and it is reporting "0%" battery (and tried unsuccessfully to hibernate itself).

It's clearly not actually at 0% battery, because I'm using it right now to type this... So maybe something is off with the "improved power reporting"!

Update: after a few minutes, the laptop did in the end power off; it took a few minutes more before it would start to charge again, too... Oops! Sorry for the noise...

tsmithe commented 11 months ago

Sorry to say that v1.8.0 hasn't (entirely?) fixed the resume bug. I just experienced it again; here's the extract from the corresponding dmesg. I am running Linux 6.4.0-4-amd64 from the Debian testing image (version 6.4.13-1).

resume-0916.log

pastyfiend commented 11 months ago

Afraid I have a problem with this too. Now seems unable to charge via USB-C... I don't think anything else has changed, so assume this is related to the firmware upgrade?

tsmithe commented 11 months ago

Another thing I have noticed is that power consumption seems to be greater on 1.8.0, including while the system is suspended. This morning the battery seemed to lose 10 percentage points of charge over a couple of hours of suspend... (I haven't done rigorous measurements though.)

Sean-StarLabs commented 11 months ago

@tsmithe It might have done - the EC code that polls the processor for various info is timing out, and that alternates with the battery and charger. Will track those in #126 and then come back here once fixed.

tsmithe commented 10 months ago

I have only encountered this bug once so far on v1.14 -- but I have still encountered it ... Here's a dmesg extract:
1031.log