genodelabs / genode

Genode OS Framework
https://genode.org/
Other
1.08k stars 254 forks source link

Sculpt/x86: integrate ACPI suspend/resume #5180

Closed alex-ab closed 6 months ago

alex-ab commented 7 months ago

With #4669, #5101 and #5081 a basic test scenario is working, e.g. X201, X260 and T490.

To use this feature in Sculpt, several further steps are necessary, which this issue is about to track/co-ordinate.

Currently following missing pieces are identified:

alex-ab commented 7 months ago

@jschlatow: I discussed about the topic with @nfeske in the morning and I'm going to integrate the assigned pieces to me into Sculpt. May you have a look regarding the enablement of the pc_platform_drv? I gave it a short test in run/acpi_suspend but without success, which is expected. The re-initialization at least of the IOMMU is required, I presume. The second TMP commit may help to test drive the feature w/o actually going to suspend.

alex-ab commented 7 months ago

In the first TMP commit I disabled the IOMMU support, which shows that the pc_platform_drv is in principle working. I tested it on a X260, since I got the info that you have such a test machine.

jschlatow commented 7 months ago

@alex-ab I'll have a look next week. I do have an X260 but without AMT, I can get my hands on an X230 though.

alex-ab commented 7 months ago

@nfeske: my issue_5180 contains the adjustments for Sculpt you requested

nfeske commented 7 months ago

@alex-ab Thanks a lot. That looks just excellent!

alex-ab commented 7 months ago

@alex-ab I'll have a look next week. I do have an X260 but without AMT, I can get my hands on an X230 though.

Just as note: the serial output does not ever work after resume (no one re-power the device), e.g. it can also hang the resume, therefore it is disabled in the acpi_suspend.run script by default when used with hardware.

jschlatow commented 7 months ago

@alex-ab I set up an x230 with iPXE and AMT/SOL for testing. With IOMMU disabled, suspend/resume is basically functional, however, I made two observations:

  1. When the intel_fb_drv is "shut down", the display shows a scrambled image. When it is re-started, everything operates normally. I enabled the IOMMU and saw a read fault on address 0. This might be unrelated, though. On my x260, I cannot reproduce this effect.
  2. Sometimes, I see a huge resource request from nitpicker (see below). When this occurs, the display remains black after resume. Looking at the power led, the remaining system continues normally with the suspend/resume cycle.
[init -> dynamic_rom] system: change (notify non-restartable drivers)
[init -> dynamic_rom] system: sleep 4000 milliseconds
[init -> dynamic_rom] system: change (notify non-restartable drivers)
[init -> dynamic_rom] system: sleep 4000 milliseconds
[init -> nitpicker] Warning: Gui (wm -> wm -> terminal -> ) not enough RAM to preserve buffer content during resize
[init] child "nitpicker" requests resources: ram_quota=18446743592673218668
[init -> dynamic_rom] system: change (notify non-restartable drivers)
[init -> dynamic_rom] system: sleep 4000 milliseconds
jschlatow commented 7 months ago

2. Sometimes, I see a huge resource request from nitpicker (see below). When this occurs, the display remains black after resume. Looking at the power led, the remaining system continues normally with the suspend/resume cycle.

This is only triggered when I maximize the terminal window before the suspend. I instrumented the corresponding method in nitpicker. Apparently, there is a crude mode change that's causing this:

[init -> drivers_init -> intel_fb_drv] framebuffer reconstructed - virtual=1366x768 physical=1376x768
[init -> drivers_init -> intel_fb_drv]     LVDS-1:  enable name=' 1366x768' id=1  mode=1366x 768@60  fb=1366x 768
[init -> nitpicker] next_buffer_size: 4037440
[init -> nitpicker] orig_buffer_size: 3744000
[init -> nitpicker] mode: 1364x740
[init -> nitpicker] next_buffer_size: 530688
[init -> nitpicker] orig_buffer_size: 467712
[init -> nitpicker] mode: 1382x64
[init -> dynamic_rom] system: change (notify non-restartable drivers)
[init -> dynamic_rom] system: sleep 4000 milliseconds
[init -> dynamic_rom] system: change (notify non-restartable drivers)
[init -> dynamic_rom] system: sleep 4000 milliseconds
[init -> nitpicker] next_buffer_size: 18446743592673214572
[init -> nitpicker] orig_buffer_size: 4037440
[init -> nitpicker] mode: 4294967295x4294967269
[init -> nitpicker] Warning: Gui (wm -> wm -> terminal -> ) not enough RAM to preserve buffer content during resize
[init] child "nitpicker" requests resources: ram_quota=18446743592673218668
jschlatow commented 7 months ago

I tried to fix this with 46904c0. It fixes the RAM-quota check in nitpicker but then I get a similar resource request from terminal:

[init] Warning: terminal: RAM upgrade of Gui failed
[init] child "terminal" requests resources: ram_quota=18446743592669177132

@nfeske Do you have any suggestions?

alex-ab commented 7 months ago
1. When the intel_fb_drv is "shut down", the display shows a scrambled image. When it is re-started, everything operates normally. 

Since the intel display driver is killed, not orderly shut down, it is kind of expected. I currently discussing with @cnuke, @ssumpf to reduce the visible noise in this case.

nfeske commented 7 months ago

Do you have any suggestions?

Not from the tip of my head. I'd investigate where the botched up mode info comes from. When intel_fb vanishes, I presume that its capture session is closed. Nitpicker determines the mode dimensions using the max of all capture sessions (Capture_root::bounding_box). Could the vanished capture session lead to a situation where this max computation results in a mode that yields a integer underflow down the road? When looking at the core, the result in case of no capture session is (0,0). In https://github.com/genodelabs/genode/blob/master/repos/os/src/server/nitpicker/gui_session.cc#L399 we enforce the invariant of the buffer size of GUI client to be at least (1,1). Should we apply the same rationale in Capture_root::bounding_box?

jschlatow commented 7 months ago

When looking at the core, the result in case of no capture session is (0,0). In https://github.com/genodelabs/genode/blob/master/repos/os/src/server/nitpicker/gui_session.cc#L399 we enforce the invariant of the buffer size of GUI client to be at least (1,1).

I instrumented the mode() method to see what it returns. It actually returns (1,1), yet, the client gets (4294967295,429496726) from the RPC. I'm puzzled.

jschlatow commented 7 months ago

@alex-ab Have you ever observed the effect that the power-button does not wake up the laptop anymore? I've set up IOMMU re-initialization, which is working when NOVA is booted without the iommu option. However, once I enable this option the device is stuck in suspend because the power button seems to be ineffective. Could you maybe elaborate on the basic resume procedure?

alex-ab commented 7 months ago

@alex-ab Have you ever observed the effect that the power-button does not wake up the laptop anymore? I've set up IOMMU re-initialization, which is working when NOVA is booted without the iommu option. However, once I enable this option the device is stuck in suspend because the power button seems to be ineffective. Could you maybe elaborate on the basic resume procedure?

With commit 782357e5b801dab485945b160b955763a02e5c99 the pc_platform_drv now issues the final suspend and therefore the test component should not do it anymore, e.g. see 655d12536223be59b16a41685b37fab2f6998f9c.

When the nova kernel resumes it retakes the iommu and reinitialize it, so maybe you have to do your procedure of re-taking again (or you did already?), which you applied already during pc_platform_drv bootup ? With 782357e5b801dab485945b160b955763a02e5c99 after _suspend() finished, you already resumed and can apply your iommu adjustments.

If you have a branch I may test, please give a hint. The X201 next to me with pcmcia serial card is quite chatty after resume.

alex-ab commented 7 months ago
1. When the intel_fb_drv is "shut down", the display shows a scrambled image. When it is re-started, everything operates normally. 

Since the intel display driver is killed, not orderly shut down, it is kind of expected. I currently discussing with @cnuke, @ssumpf to reduce the visible noise in this case.

With commit a903486a980852b84a66bc1becb5d3d549be546e the visual noise vanishes for me, @cnuke, @ssumpf, please have a look and thx for your input.

ssumpf commented 7 months ago

@alex-ab:https://github.com/genodelabs/genode/commit/a903486a980852b84a66bc1becb5d3d549be546e looks good to me :+1:

jschlatow commented 7 months ago

With commit a903486 the visual noise vanishes for me, @cnuke, @ssumpf, please have a look and thx for your input.

I can confirm that a903486 removes the noise but only with disabled IOMMU. When I enable the IOMMU, I still get visual noise because the DMA buffers are freed while there still seem to be ongoing DMA transactions. I'm seeing a bunch of IOMMU faults from the address ranges where the display driver's DMA buffers had been. Without a903486, the faults were on address 0.

jschlatow commented 7 months ago

When the nova kernel resumes it retakes the iommu and reinitialize it,

Thanks for the hint. I wasn't fully aware of that and therefore forgot to disable queued invalidation. Issuing invalidation via the register-based interface while the queued interface was still enabled apparently got the system stuck. Commits 1e9abfe and e74a6a3 implement the corresponding changes to the platform driver.

alex-ab commented 7 months ago

@jschlatow: very nice, I will give it a try on the machines I have here

nfeske commented 7 months ago

Commit https://github.com/nfeske/genode/commit/c4df9d8fa2223d8b215e791294da058ded01174f implements the sculpt-manager support for suspend/resume. Tested on my x260 using the TMP commits on my https://github.com/nfeske/genode/commits/sculpt_24_04/ branch.

nfeske commented 7 months ago

With @jschlatow's latest platform-driver changes, I no longer need the TMP commits and can leave the IOMMU enabled. Hence, I merged the current state to staging. So the interactive suspend option can be more broadly tested.

The noisy pattern on suspend is present on the x260. Maybe one possible remedy would be to disable the intel_fb connectors before killing intel_fb? But that requires additional sculpt-manager support (watching the connectors report, generating the intel-fb config) that is too elaborate at present.

nfeske commented 7 months ago

On the x250, the following messages occur on suspend:

Genode 24.02-184-g7570cc95539 <local changes>
7729 MiB RAM and 63249 caps assigned to init
[init] Warning: runtime: configured RAM exceeds available RAM, proceed with 7600413008
[ 0] IOMMU:0xffffffff817ea060 FRR:0 FR:0x1 BDF:0:2:0 FI:0x00000000 (0)
[ 0] IOMMU:0xffffffff817ea0b0 FRR:0 FR:0x5 BDF:0:1f:2 FI:0xca6ce000 (0)
[ 0] IOMMU:0xffffffff817ea060 FRR:0 FR:0x6 BDF:0:2:0 FI:0xd0940000 (0)
[ 0] IOMMU:0xffffffff817ea060 FRR:0 FR:0x6 BDF:0:2:0 FI:0xd09f6000 (0)
[ 0] IOMMU:0xffffffff817ea060 FRR:0 FR:0x6 BDF:0:2:0 FI:0xd0b07000 (0)
[ 0] IOMMU:0xffffffff817ea060 FRR:0 FR:0x6 BDF:0:2:0 FI:0xd0410000 (0)
[ 0] IOMMU:0xffffffff817ea060 FRR:0 FR:0x6 BDF:0:2:0 FI:0xd0410000 (1)
[ 0] IOMMU:0xffffffff817ea060 FRR:0 FR:0x6 BDF:0:2:0 FI:0xd061a000 (0)
[ 0] IOMMU:0xffffffff817ea060 FRR:0 FR:0x6 BDF:0:2:0 FI:0xd0713000 (0)
[ 0] IOMMU:0xffffffff817ea060 FRR:0 FR:0x6 BDF:0:2:0 FI:0xd082b000 (0)
[ 0] IOMMU:0xffffffff817ea060 - disabling fault reporting
Error: illegal WRITE at address 0x14040000 by pager_object: pd='init -> runtime -> intel_fb' thread='ep' ip=0x11d2df3

After a while:

The system is powered off.

I'm unable to wake up the x250.

jschlatow commented 7 months ago

I'm unable to wake up the x250.

Do you at least see the power led blinking when trying to wake up the x250?

jschlatow commented 7 months ago

[ 0] IOMMU:0xffffffff817ea0b0 FRR:0 FR:0x5 BDF:0:1f:2 FI:0xca6ce000 (0)

This is the same fault as you reported in #5066. I'm wondering what the AHCI controller is expecting at 0xca6ce000.

jschlatow commented 7 months ago

@nfeske Can you cherry pick 9f2187e and run the test on the x250 again? The commit adds debug output so we can see when the devices are enabled/disabled (for the IOMMU) and when the DMA buffers are unmapped.

nfeske commented 7 months ago

Do you at least see the power led blinking when trying to wake up the x250?

The led keeps pulsing slowly like undisturbed sleep.

jschlatow commented 7 months ago

Do you at least see the power led blinking when trying to wake up the x250?

The led keeps pulsing slowly like undisturbed sleep.

@alex-ab Could this be related to what you mentioned earlier with the resume hanging when serial output is used?

@nfeske Could you share your /report/drivers/iommu? I'd like to make sure that the x250 has no weird capability flags set.

nfeske commented 7 months ago

I gave the suspend feature a try on an x201 and a Framework Gen13 booted off a USB stick (using staging). For those two machines, I was unable to trigger the resume. The suspend-resume cycle on the x260, however, works just fine with the same USB boot stick.

nfeske commented 7 months ago

Could you share your /report/drivers/iommu?

Here you go:

<iommu>
    <intel name="drhd1" dma_remapping="true" msi_remapping="true" irq_remapping="true" version="1.0">
        <register name="Capability" value="0xd2008c20660462" esrtps="false" esirtps="false" rwbf="false" nfr="0" domains="2" caching="false"/>
        <register name="Extended Capability" value="0xf010da" interrupt_remapping="true" page_walk_coherency="false"/>
        <register name="Global Status" value="0xc3000000" qies="false" ires="true" rtps="true" irtps="true" cfis="false" enabled="true"/>
    </intel>
    <intel name="drhd0" dma_remapping="true" msi_remapping="true" irq_remapping="true" version="1.0">
        <register name="Capability" value="0x1c0000c40660462" esrtps="false" esirtps="false" rwbf="false" nfr="0" domains="2" caching="false"/>
        <register name="Extended Capability" value="0x7e1ff0505e" interrupt_remapping="true" page_walk_coherency="false"/>
        <register name="Global Status" value="0xc3000000" qies="false" ires="true" rtps="true" irtps="true" cfis="false" enabled="true"/>
    </intel>
</iommu>
jschlatow commented 7 months ago

@nfeske There are three scenarios you could try with the x201 and Framework:

  1. Remove the iommu option from NOVA's command line.
  2. Disable the IOMMU for the platform driver (17d219d).
  3. Combine 1+2
nfeske commented 7 months ago

I just gave the combination of 1+2 a try. Same result on the Framework and x201. On the x250, this version does not manage to enter the suspend state (screen freezes - presumably because intel-fb is removed - but it never gets dark).

alex-ab commented 6 months ago

I gave the test X201 a suspend/resume test. It works for me if serial is off and the depot is not on the USB stick (USB booted -> ram_fs as depot source).

alex-ab commented 6 months ago

My T460p works, but I just used ahci as boot medium and a bit ethernet. Resume works, internal screen flickers all the time afterwards, but external is stable. When using the audio driver, however wakeup does not work. So, it is mainly the state I expect atm.

nfeske commented 6 months ago

With commit https://github.com/alex-ab/genode/commit/f450cdd02824f29655d5126e19bad9cffea11e7a suspend-resume works on the Gen13 Framework.

nfeske commented 6 months ago

I gave the test X201 a suspend/resume test.

Can confirm. The suspend-resume cycle completes now on my x201 as well. After resume, the wifi driver stumbles (https://github.com/genodelabs/genode/issues/5185). Manual restarting the wifi driver does not help.

nfeske commented 6 months ago

The suspend-resume cycle completed successfully on my Gen12 Framework laptop.

alex-ab commented 6 months ago

During resume the intel display driver races with the resume of the intel gpu driver, and may fail with a write fault because of not available mmio:

[init -> drivers_init -> intel_fb_drv] time-clocksource: Switched to clocksource dde_counter
[init -> drivers_init -> intel_fb_drv] disabling PPGTT to avoid GPU code paths
Error: illegal WRITE at address 0x1100a278 by pager_object: pd='init -> drivers_init -> intel_fb_drv' thread='ep' ip=0x115b144
[init -> dynamic_rom] system: change (system normal)
[init -> dynamic_rom] system: sleep 30000 milliseconds
[init -> gpu_drv] driver resumed
...

Thanks to the heartbeat+restart mechanism in Sculpt, this may not be fatal (but not desired). With acpi_suspend.run it fails however reliable/reproducible.

alex-ab commented 6 months ago

With the 3 commits, the reported race on resume is avoided.

nfeske commented 6 months ago

Thanks. I merged the 3 commits to staging.

alex-ab commented 6 months ago

@nfeske: the issue_5180 contains the blanking state change for intel/display and intel/gpu and the acpi_suspend test case. Do you want to do the sculpt integration or should I ?

alex-ab commented 6 months ago

@nfeske: I updated issue_5180, which now contains also the integration into Sculpt. With the commits the graphical noise is gone for me, on a T490.

nfeske commented 6 months ago

Thank you for the commit series. Sorry about the accidental gpu-driver restart. Bummer. I merged the commit series to staging and will immediately the new version a try.

nfeske commented 6 months ago

@alex-ab I perceive the immediate blanking as a massive improvement (tested on the x260). The mechanism feels so much more solid this way.

nfeske commented 6 months ago

@alex-ab fixup https://github.com/genodelabs/genode/commit/dc17c5f89dcc11b251de1cffb6c474c83f461b26 implements the idea we came up with @chelmuth this morning. Could you give it a review and try?

alex-ab commented 6 months ago

@alex-ab fixup dc17c5f implements the idea we came up with @chelmuth this morning. Could you give it a review and try?

Looks good and works, tested on two machines :+1:

nfeske commented 6 months ago

@alex-ab unfortunately, commit https://github.com/genodelabs/genode/commit/f7e11f303db71256fc3614b0ad3b9afbe121d429 (avoid restarting intel_gpu) prevents successful resume on my Gen12 Framework (it works with the commit reverted).

nfeske commented 6 months ago

It works on the Gen13 Framework though.

nfeske commented 6 months ago

The commits entered the master branch. Let us open new issues for subsequent steps.

chelmuth commented 3 months ago

For the record, I just successfully tested two consecutive suspend-resume cycles on T14 Gen1 with Sculpt 24.04.1.