Suspend/Resume working + patch

yanko-yankulov commented 7 years ago

Hi guys,

Suspend / Resume worked out for me on Skylake GPU almost out of the box (drm-next-4.7 branch). The only thing I had to do is add calls to the suspend_late & resume_early calls.

Simple patch below.

Thanks for all the work that made this possible

--- a/sys/compat/linuxkpi/common/src/linux_pci.c
+++ b/sys/compat/linuxkpi/common/src/linux_pci.c
@@ -315,7 +315,14 @@ int pci_default_suspend(struct pci_dev *dev,
         int err;

         if(dev->pdrv->driver.pm->suspend != NULL)
+        {
                 err = -dev->pdrv->driver.pm->suspend(&(dev->dev));
+                if( err == 0 )
+                {
+                    if( dev->pdrv->driver.pm->suspend_late != NULL )
+                        err = -dev->pdrv->driver.pm->suspend_late(&(dev->dev));
+                }
+        }
         else
                 err = 0;

@@ -326,6 +333,13 @@ int pci_default_resume(struct pci_dev *dev)
 {
         int err;

+        if(dev->pdrv->driver.pm->resume_early != NULL )
+        {
+            err = -dev->pdrv->driver.pm->resume_early(&(dev->dev));
+            if( err )
+                return err;
+        }
+
         if(dev->pdrv->driver.pm->resume != NULL)
                 err = -dev->pdrv->driver.pm->resume(&(dev->dev));
         else
--

problame commented 7 years ago

Did suspend & resume work with Xorg running and resuming correctly? Would certainly motivate me to try drm-next-4.7 on my t460s...

yanko-yankulov commented 7 years ago

Yep,

Suspend & resume from Xorg works as well.

abishai commented 7 years ago

@yanko-yankulov Have you experienced this problem without patch ? https://github.com/FreeBSDDesktop/freebsd-base-graphics/issues/68 This one keeps me away from drm-next branch mostly.

mattmacy commented 7 years ago

@abishai that's whole point of the issue - he needed to add the patch in order for it to work.

abishai commented 7 years ago

Maybe there are several issues with resume, not only pipe A faults. Probably, it's time to test my laptop again.

mattmacy commented 7 years ago

@abishai have you tried the patch? Without it suspend/resume doesn't work for anyone on skylake.

yanko-yankulov commented 7 years ago

Without the patch I got a psychadelic display with all the colours of the rainbow but no actual image (after the first resume, that is). With it everything is ok. I have not experienced any other major issue and everything seems to work just fine. I guess I just might have been lucky and there are other hardware/configuration issues that can cause trouble. I dont have different hardware at the moment to test it.

However a patch like that seems to be necessary as part of the low-level suspend/resume code is happening in the late/early callbacks. The comments in the code explain that the split is needed to solve races with the HDMI audio output device so the quick hack I did in the patch seems as a goog enough initial solution.

So give it a try - it really 'just worked' for me.

yanko-yankulov commented 7 years ago

Just one more thing - I havent checked the dmesg without the patch but the description of #68 seems to be exactly what I observed. (Althoug I believe my lines have been horisontal. Didn't really pay much attention) My bad that I havent noticed it before opening this issue.

mattmacy commented 7 years ago

FYI, I've merged this in to drm-next and I can now resume on my Skylake (Gen4 X1). However, I'm getting a blank screen and shortly after resume the hangcheck timer causes a lock inversion panic. The cores I'm getting are usually corrupted so it's hard to see what locks the thread is holding.

mattmacy commented 7 years ago

I've fixed the panic (the third one so far testing suspend/resume!), so I can still work remotely on the laptop after resume. However, the display still doesn't come back. I'm also getting a panic on shutdown

abishai commented 7 years ago

I think I have some time today to put my laptop in jeopardy. Should I use drm-next for testing?

mattmacy commented 7 years ago

You should try it yes, but I don't think suspend resume really works there yet, otherwise I don't think it has any regressions with respect to drm-next-4.7

abishai commented 7 years ago

My laptop died during llvm compilation:) That's all my long tongue about laptop in jeopardy, but power supply is really burned out. I hope Dell can replace it fast enough.

yanko-yankulov commented 7 years ago

A quick update - just tried fresh drm-next. Suspend/Resume works, but the screen is blank - the same as @mattmacy said. So there seems to be something missing in the drm-next branch. I would like to poke the code a bit to see if something comes out, but not quite sure when I will get to it :).

grahamperrin commented 7 years ago

Anecdotally: post 5 and others under Screen doesn’t turn on when waking from sleep – TrueOS Community

mattmacy commented 7 years ago

@grahamperrin point others here. I don't have the bandwidth to go on forums.

grahamperrin commented 7 years ago

@mattmacy thanks for the direction; done – graphics-oriented post 11 under Resume from suspend (wake from sleep) and from hibernation.

(I hesitated because I wasn't sure about the scope of this issue.)

dlocklear01 commented 7 years ago

I am a computer novice using TrueOS ( FreeBSD 12.0 ) I have Intel i7. Suspend feature locks up computer. I do not know how to cut and paste the patch shown above.

yanko-yankulov commented 7 years ago

@mattmacy A patch for drm-next. It took me like three times going through the diffs to finally see it. And I had to start inserting printfs to realize that they will not get called, but yeah its working :).

diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index 2e30cff0c57..267501f0a31 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -474,6 +474,8 @@ static struct pci_driver i915_pci_driver = {
        .probe = i915_pci_probe,
        .remove = i915_pci_remove,
        .driver.pm = &i915_pm_ops,
+       .suspend = pci_default_suspend,
+       .resume = pci_default_resume
 };

 static int __init i915_init(void)

trombonehero commented 7 years ago

After applying @yanko-yankulov's driver methods patch, suspend and resume seem to work somewhat! Suspend/resume from the console seems pretty solid with i915kms loaded, but I do get a panic a few seconds after a suspend/resume cycle from X. One of the kasserts I saw was _page_insert_after: page already inserted with the following backtrace:

vm_page_insert_after
remap_io_mapping
i915_gem_fault
linux_cdev_pager_populate
vm_fault_hold
vm_fault
trap_pfault
trap
calltrap

The other was vm_reserv_populate: reserv %p is already promoted with the following (partially similar) backtrace:

vm_reserv_populate
vm_reserv_alloc_page
vm_page_alloc
vm_fault_hold
vm_fault
trap_pfault
trap
calltrap

This is also using the latest fixes in markjdb/freebsd-base-graphics... are these two things likely to be related?

Finally, is the advice about driver selection at https://github.com/FreeBSDDesktop/freebsd-base-graphics/wiki/Testing-And-Debugging-Tips still current, or should I be using the modesetting driver instead? Perhaps that would explain the difference between the console and X behaviours?

trombonehero commented 7 years ago

A further update: after merging 3ad2821 from markjdb/freebsd-base-graphics, I no longer have these crashes. In fact, the system is stable enough that I might enable hw.acpi.lid_switch_state!

yanko-yankulov commented 7 years ago

Agreed.

I played a bit more and I hit the same crashes. Running glxgears after the first resume causes them immediately. I have just not ran anything gl-dependant before. Applying Mark Johnston's fixes solves the issue.

dlocklear01 commented 7 years ago

Could you please explain to a computer novice ( me ) how to attempt fix. I really do not understand any terminal commands.

On Feb 3, 2017 4:23 PM, "yanko-yankulov" notifications@github.com wrote:

Agreed.

I played a bit more and I hit the same crashes. Running glxgears after the first resume causes them immediately. I have just not ran anything gl-dependant before. Applying Mark Johnston's fixes solves the issue.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/FreeBSDDesktop/freebsd-base-graphics/issues/111#issuecomment-277378962, or mute the thread https://github.com/notifications/unsubscribe-auth/AYUZ5RB2YTpo8RsDqTrdKsR9PYARJXc9ks5rY6jngaJpZM4LoX2j .

mattmacy commented 7 years ago

@yanko-yankulov I made a similar change that did not require a change to the driver itself. https://github.com/FreeBSDDesktop/freebsd-base-graphics/commit/e1b3fdbffece21b54ca5834a68f09a7231781323

yanko-yankulov commented 7 years ago

Greate, thanks. I will give it a try.

I am currently chasing an EAGAIN followed by EFAULT in GEM_EXECBUFFER2 witch leads to "(EE) intel(0): Failed to submit rendering commands (Bad address), disabling acceleration." in Xorg log, but will test the patch (hopefully) later today.

mattmacy commented 7 years ago

I had that early on too. I don't recall how I fixed it though

grahamperrin commented 7 years ago

@dlocklear01 hi, I suggest waiting a few days. We expect a fix to be made readily available to testers of the UNSTABLE repository for TrueOS Desktop.

yanko-yankulov commented 7 years ago

@mattmacy The patch works perfectly. Thanks.

Regarding the crashes that @trombonehero reported, the commit that solves them seems to be https://github.com/markjdb/freebsd-base-graphics/commit/b5d2ebc23a45b6ebbcacf7b6857e6be84e9430f3 with the addition of https://github.com/markjdb/freebsd-base-graphics/commit/3ad2821545d358d24d895b17ada19f6f79d7c64b to allow compilation.

About the EAGAINs and EFAULTs that I mentioned earlier I am almost 100% sure are related https://github.com/markjdb/freebsd-base-graphics/commit/681ae75d83eab79b489c5bc7f5c9cf4a847a2439 that I cherry-picked with the rest of the Mark Johnston's fixes. I still have to verify it, but the EAGAINs seems perfectly fine, and the faults originate in get_user_pages_remote which is the main affected function in the commit.

And I was absolutely, absolutely sure, that I retested with clean drm-next before delving into the code... (Note to self 'make buildkernel' is not quite the same as 'make kernel' ). :)

Thanks again

markjdb commented 7 years ago

The aforementioned commits have been pushed to drm-next.

yanko-yankulov commented 7 years ago

Hi,

@markjdb thanks for the push.

With the current drm-next (27fdac1031286aaaf152e29bf3bdd042c95595bf) I am reliably getting the mentioned EFAULTs and disabled acceleration in X.

How to reproduce - start X, launch firefox, restore previous session, click on a few tabs to reload them and voila.

With e47e330c4bc73ad3327dac454b8869d17ba710cf reverted I am not seeing the issue.

I have also added a test debug printf in get_user_pages_remote (before reverting e47e330c4bc73ad3327dac454b8869d17ba710cf ) to compare new vm_map from from the mm_struct, and the old vm_map from the task and they are different.

The path to reaching get_user_pages_remote seems straightforward.

ioctl(I915_GEM_EXECBUFFER2) with at least one of the buffers with substential size (11MB or so) and probably first time the buffer is referenced.
gem_do_execbuffer / gem_execbuffer_reserve / i915_gem_execbuffer_reserve_vma / i915_vma_pin / i915_vma_insert / i915_gem_object_get_pages / i915_gem_userptr_get_pages
the initial __get_user_pages_fast in i915_gem_userptr_get_pages fails to ref all the pages and the task gets scheduled - and from the task the call get_user_pages_remote.

I have not yet traced the root cause further. I am still trying to to wrap my mind around all that code and it would take me some time to figure it all out.

With the revert of the mentioned commit suspend/resume works perfectly so far. Thanks.

hselasky commented 7 years ago

Hi,

Can you re-test with the latest drm-next branch?

--HPS

yanko-yankulov commented 7 years ago

Hi,

Exact same behavior (EFAULT/acceleration off) on current head(aeeba20f7ceb3dca0e28ac62eeed14d5f9ec8eeb). And again issue is not observed with e47e330 reverted.

hselasky commented 7 years ago

Which intel driver are you using?

pkg info | grep intel

--HPS

hselasky commented 7 years ago

If you are using the stock one, can you try to install this one instead (no need to recompile other ports or dependencies): https://github.com/FreeBSDDesktop/freebsd-ports-graphics/tree/xserver-mesa-next/x11-drivers/xf86-video-intel

--HPS

yanko-yankulov commented 7 years ago

xf86-video-intel-2.99.917.20160614 from xserv-mesa-next-udev branch

yanko-yankulov commented 7 years ago

it's the same in xserver-mesa-next

yanko-yankulov commented 7 years ago

dri3 is compiled but not enabled at runtime

hselasky commented 7 years ago

OK, I think @markjdb needs to have a look at this.

yanko-yankulov commented 7 years ago

If no body else is seeing the same issue don't worry, I will dig into it. @markjdb Do you happen to have a test case for the issue the commit fixes?

markjdb commented 7 years ago

@yanko-yankulov I've reverted the commit for now - I'm currently working on fixing issues with the amdgpu driver; I'll swing back to this when I have more time.

I hit the issue with one of the tests in gem_mmap_userptr here: https://github.com/markjdb/intel-gpu-tools . The code with e47e330 reverted is definitely wrong, but it seems that I'm missing something.

yanko-yankulov commented 7 years ago

Hey @markjdb, thanks for that. I looked more carefully at the patch and the Linux code and it deffinitely seems like the right thing to do. I will try to trace the issue further as it got me hooked. Will update if I find somehting.

yanko-yankulov commented 7 years ago

I am blind like a bat :)

@markjdb this fixes the problem:

--- a/sys/compat/linuxkpi/common/src/linux_fs.c
+++ b/sys/compat/linuxkpi/common/src/linux_fs.c
@@ -476,7 +476,7 @@ get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 {
        vm_map_t map;

-       map = &((struct vmspace *)mm->vmspace)->vm_map;
+       map = &(*(struct vmspace **)mm->vmspace)->vm_map;
        return (__get_user_pages_internal(map, start, nr_pages,
            !!(gup_flags & FOLL_WRITE), pages));
 }

To correspond to the "mm->vmspace = &td->td_proc->p_vmspace;" in linux_alloc_current.

I am wondering though, wouldn't it be better to save a pointer to the td_proc in the mm_struct instead of a pointer inside it? If this was the initial idea of course.

markjdb commented 7 years ago

That makes two of us. :)

Actually, I would fix this other way - linux_alloc_current() should just assign mm->vmspace = td->td_proc->p_vmspace rather than taking the address. In any case, thank you for tracking this down!

yanko-yankulov commented 7 years ago

@markjdb. Np.

I was wondering if the vmspace could get changed sometimes after the alloc_current - I doubt it, but saw several assignments in the vm code so just mentioning it.

markjdb commented 7 years ago

You're right. There is a couple of cases:

AIO plays some tricks where it temporarily replaces the vmspace, but we don't really care about this
A new vmspace is allocated upon execve. This is indeed a problem, but needs a more holistic solution: much of the LinuxKPI per-thread state is invalidated upon execve, and we should be registering a handler to fix it up as needed.

dch commented 7 years ago

FWIW just tried zzz on my dell xps13 and drm-next today. works perfectly. kudos and thanks for your efforts!

mattmacy commented 7 years ago

There are open issues with resume not being able to correctly reload firmware, but that's separate from this.

FreeBSDDesktop / DEPRECATED-freebsd-base-graphics

Suspend/Resume working + patch #111