Closed tsmithe closed 8 months ago
I spent some more time investigating this over the weekend, and now suspect that there's something going on in the timing of the resume process that means that this bug is sometimes present: with a "slower" kernel, things seem more stable. Much more detail reported at https://gitlab.freedesktop.org/drm/amd/-/issues/2403#note_1782282
Comment from @superm1 at AMD (https://gitlab.freedesktop.org/drm/amd/-/issues/2403#note_1785030):
Something I want to point out is that with Windows nearly all the Cezanne/Barcelo APUs ship with Modern Standby. S3 is relatively only utilized in Linux, and so it's the less common path from a firmware perspective. I'm not saying it's not an AGESA bug, but once the problem is outside of the kernel the OEM needs to debug it and reach that conclusion. If it's an AGESA bug after-all the OEM can work with AMD on getting a solution. Does your BIOS advertise anything about the AGESA version? I can ask an internal team to try to setup some reference hardware with the same AGESA version, BIOS configured for S3 and latest 6.1.y to see if we can reproduce.
What do you think?
I'll reach out to Mario :)
Thanks, Sean!
I'm sorry to say I don't think the recent firmware 1.4.0 has fixed this. I installed it this afternoon, and on my first reboot (with btusb
enabled and Linux 6.2), the machine failed to resume from S3. (I also couldn't see an option in the BIOS settings to enable S2idle, so I haven't tried that; Mario suggested to me that it was necessary to change a BIOS setting in order to enable it.)
Ideally, it would be shutdown first but, might sound dumb, how are you suspending? I had 20 cycles (systemctl suspend
) in 24 hours with 5.19.
change a BIOS setting
That's just generic advise - it's there if the OS wants to us it
Ah, OK (re s2idle). I'll try that.
Regarding suspend, I use the same command (or the KDE session manager). 5.19 seems more stable than 6.2. I built mainline to make sure it wasn't anything distribution-specific. I can share my kconfig if you like. (Edit to clarify: I did do shutdown first, then I booted into 6.2, then I tried to suspend, which then failed to resume.)
Here's the dmesg: dmesg-0321.log
Note that the errors are seemingly a bit different from my earlier logs (posted to the drm/amd issue tracker).
change a BIOS setting
That's just generic advise - it's there if the OS wants to us it
So is there now a uPEP ACPI device? does fadt advertised low power idle support? Does NVME set simple suspend?
The amd_s2idle.py
script from drm/amd reports:
❌ ACPI FADT doesn't support Low-power S0 idle
❌ PMC driver `amd_pmc` not loaded
I wasn't able to test Mario's ASPM suggestion (on the drm tracker), seemingly because
[ 0.367914] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
(But I'm sure you knew about that already!)
I do think this issue might still want to be marked as open, though...
I think I'm missing your point - why would it be need to be opened?
Uhh, I'm still suffering from it... but now I see I seem still to be on 1.4.0... I shall investigate!
Oh gotcha. If you can reproduce with the latest one, just use #76 - we only need one issue for S3 not working, regardless of the cause
Unfortunately, the latest firmware (1.5.0) hasn't resolved the issue: I had a failure to resume on the 3rd attempt. Here's a dmesg: dmesg-0410.log
Isn't #76 about s2idle (S0ix), not S3? (Mario suggested that S0ix was preferable these days on AMD platforms anyway...)
Ah, OK. Do you plan to support s2idle in the future? (Just curious; working S3 would be fine for now!)
I'm on Debian testing. I can mostly avoid the failures to resume by running Ubuntu's kernel 5.15.0-66-generic
(from jammy and backported to focal); I haven't tried newer versions of the Ubuntu kernel.
To try to make sure that the failures aren't a distro-specific thing (and to make it easier to bisect if that became necessary), I find it easiest to reproduce the issue using my build of the 6.2 mainline kernel (torvalds tag v6.2
); I just took the Ubuntu config and stripped out as many of the irrelevant modules as I could (to make it faster to build). Here's the kconfig I used:
config-6.2.0-tsmithe
(I'm pretty sure though the failure to resume is nothing to do with my config, because I can also reproduce it on the standard Debian testing kernel.)
Ideally, we'd use s0ix as a last resort - lots of distros dont work with it at all.
I've just been testing with lunar - as I couldn't reproduce it with kinetic. I'll try mainline
It is mysterious that you struggle to reproduce it, because it is quite easy to reproduce here!
I can replicate it on Debian with that kernel... wonder if it's just a driver bug. @superm1 Would you happen to know if any drivers, microcode or anything like that could cause something like this?
@superm1 Would you happen to know if any drivers, microcode or anything like that could cause something like this?
Not that I'm aware of. It is communicating with firmware, and getting bad results. That's why it looks like firmware issue to me.
According to this upstream comment of Mario's, AMD tried and failed to reproduce the failure over 500 suspend-resume cycles on reference hardware.
The fact that the bug is still present given the apparent ASPM change in the latest firmware suggests that that can't be the whole story... (Likewise, I do still occasionally encounter the bug on new kernels when btusb
is blacklisted.)
I think this issue should be re-opened!
£5 says they didn't try with Debian and a custom kernel ;)
It's just mainline v6.2
! You can build it with the standard/Ubuntu/Debian config, but that takes ages (even on a 5800U)!
Right, but 6.2
works fine on Ubuntu. All I was saying was that I doubt AMD test with the same set up; most use the latest tag for the distro and don't change anything - probably testing
for Debian.
I double checked the logs; It was 6.1.12 that was tested at that time.
I have also had problems on 6.1. Strange that Ubuntu kernels seem to work better.
Strange indeed - whilst writing these messages, I've had one on lunar
with 6.2.0-20
- now on the 49th cycle, and still happy.
£5 says they didn't try with Debian and a custom kernel ;)
I just had internal team test this. They ran 100 cycles on Debian 11 w/ a hand compiled 6.2 with the kernel config linked above and didn't hit the failure on the reference hardware running a current BIOS.
Trying to hypothesize a difference in the test environment (aside from the WLAN card) - can you do an experiment to ONLY test suspending and resuming with power adapter connected? Can you only trigger the issue on battery perhaps?
Trying to hypothesize a difference in the test environment (aside from the WLAN card) - can you do an experiment to ONLY test suspending and resuming with power adapter connected? Can you only trigger the issue on battery perhaps?
Aren't all the AMD reference boards memory down?
I've been starting S3 loops with a test EC that doesn't report the battery to ACPI but the TSI/RMI changes based on charge level - I'll experiment with that.
Aren't all the AMD reference boards memory down?
No, some do accept socketed memory.
Hello again... I wonder if any progress might've been made on this? Particularly given the recent finding that the bug may be related to the state of the power adapter https://gitlab.freedesktop.org/drm/amd/-/issues/2403#note_1947732
Interesting
Some - all being good, should have something to test tail end of this week
Hi Sean, I just wondered if any progress had been made on this?
Yes - this one is ready to go, but I wanted to roll in the latest AGESA as it fixed 3 other bugs (2 unreported by anyone, 1 only reported about a week ago) and quite nicely boosts performance. However, that caused a regression where they won't shut down.
My plan, unless anyone calls for otherwise, is to try and figure that one out before the next version. If I'm not there by the end of the month, I'll just release it without the AGESA update.
If it was up to me, I'd wait - performance boost sounds promising!
Jokes aside - I've noticed that after switching (Ubuntu) to Mainline Kernels > 6.4.1, my suspend works much more reliable. Haven't encountered a single failed wakeup in weeks.
(I've found 6.4 quite reliable, but not perfectly: so far, it's about as good as 5.15 for me, which is what I found works best. So I'm really looking forward to the fix for this!)
1.8.0
now in the testing remote
I haven't had long enough to confirm that the system now always resumes from suspend, but here are a couple of initial observations, bearing in mind that I had previously been on v1.5.0:
Hello again, an update to (3): I just resumed my laptop after a night on suspend but not on charge, and it is reporting "0%" battery (and tried unsuccessfully to hibernate itself).
It's clearly not actually at 0% battery, because I'm using it right now to type this... So maybe something is off with the "improved power reporting"!
Update: after a few minutes, the laptop did in the end power off; it took a few minutes more before it would start to charge again, too... Oops! Sorry for the noise...
Sorry to say that v1.8.0 hasn't (entirely?) fixed the resume bug. I just experienced it again; here's the extract from the corresponding dmesg. I am running Linux 6.4.0-4-amd64 from the Debian testing image (version 6.4.13-1).
Afraid I have a problem with this too. Now seems unable to charge via USB-C... I don't think anything else has changed, so assume this is related to the firmware upgrade?
Another thing I have noticed is that power consumption seems to be greater on 1.8.0, including while the system is suspended. This morning the battery seemed to lose 10 percentage points of charge over a couple of hours of suspend... (I haven't done rigorous measurements though.)
@tsmithe It might have done - the EC code that polls the processor for various info is timing out, and that alternates with the battery and charger. Will track those in #126 and then come back here once fixed.
On resuming from S3 suspend, the platform does not seem to load the GPU firmware correctly. See https://gitlab.freedesktop.org/drm/amd/-/issues/2403#note_1774334 for more information.
I am on the AMI firmware (since coreboot doesn't yet seem to be available, following #65), but if (when it becomes available) switching to coreboot would resolve this bug, that would be fine by me.