Bug when modprobing nvidia in some conditions

ArchangeGabriel commented 12 years ago

I'm reporting here a problem some users reported in the french forum. There are at least 3 people facing it, and there are using different distros (at least Ubuntu and Debian).

The problem happens after the first disable/enable cycle and dissapears only by rebooting. So that in fact, Bumblebee doesn't work at all, because the card is firstly disabled, so you have to enable it when you want to use Bumblebee.

Bumblebee can't start the server because the nvidia driver can't handle the card :

[ 123.032092] acpi_call: Calling \_SB.PCI0.PEG0.PEGP._PS0 [ 123.560201] acpi_call: Call successful: 0x0 [ 124.131784] nvidia 0000:01:00.0: power state changed by ACPI to D0 [ 124.131790] nvidia 0000:01:00.0: power state changed by ACPI to D0 [ 124.131815] nvidia 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [ 124.131832] nvidia 0000:01:00.0: setting latency timer to 64 [ 124.131844] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=none,decodes=none:owns=none [ 124.132022] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:0df4) installed [ 124.132025] NVRM: in this system is not supported by the 280.13 NVIDIA Linux [ 124.132028] NVRM: graphics driver release. Please see 'Appendix A - [ 124.132030] NVRM: Supported NVIDIA GPU Products' in this release's README, [ 124.132032] NVRM: available on the Linux graphics driver download page at [ 124.132034] NVRM: www.nvidia.com. [ 124.132049] nvidia 0000:01:00.0: PCI INT A disabled [ 124.132065] nvidia: probe of 0000:01:00.0 failed with error -1 [ 124.132115] NVRM: The NVIDIA probe routine failed for 1 device(s). [ 124.132119] NVRM: None of the NVIDIA graphics adapters were initialized!

starks commented 12 years ago

Other than the Clevo W150HRM, what machines have you seen this on?

AFAIK, the DSM+PS3 method to turn off the card is valid even though the echoed message suggest the card isn't ready before PS3.

 echo _DSM $(acpi_call "\_SB.PCI0.PEG0.PEGP._DSM" \
     "{0xF8,0xD8,0x86,0xA4,0xDA,0x0B,0x1B,0x47," \
      "0xA7,0x2B,0x60,0x42,0xA6,0xB5,0xBE,0xE0}" \
      "0x100 0x1A {0x1,0x0,0x0,0x3}")
 # ok to turn off: Buffer {0x59 0x0 0x0 0x11}
 # is already off: Buffer {0x41 0x0 0x0 0x11}

eric@kingfisher ~ $ sh clevo.sh info _DSM {0x41, 0x00, 0x00, 0x11}

As for PS0, the method flips the VGA LED and seems to make the card at least electrically active.

Also note that keeping the nvidia module loaded across cycles will cause a soft kernel panic.

Finally, I will not be able to test this bug with Nouveau until a 3.1 Ubuntu kernel pops up.

ArchangeGabriel commented 12 years ago

Asus 1215N (not sure) Dell Inspiron R15 Asus K93SV (the above log is from this machine)

It's interesting to see that all machines facing the problems use the below calls (sames for all):

cardoff: \_SB.PCI0.PEG0.PEGP._DSM {0xF8,0xD8,0x86,0xA4,0xDA,0x0B,0x1B,0x47,0xA7,0x2B,0x60,0x42,0xA6,0xB5,0xBE,0xE0} 0x100 0x1A {0x1,0x0,0x0,0x3} \_SB.PCI0.PEG0.PEGP._PS3

cardon:

\_SB.PCI0.PEG0.PEGP._PS0

Samsagax commented 12 years ago

It's interesting to see that all machines facing the problems use the below calls (sames for all)

If the calls are the same most likely they are wrong. The _DSM arguments surelly are different from each other.

ArchangeGabriel commented 12 years ago

The DSM args are the same for most machines, but I'm ever verifying it before sending calls to people asking for.

In this case, those calls were all given by me.

In fact, the part of the call I'm talking about is the "path" : _SB.PCI0.PEG0.PEGP. This "path" is very different from one model to on other generally, however we can see that all machines impacted here use this path.

Samsagax commented 12 years ago

Just wondering: Maybe there is a _DSM call to make after the _PS0 call? seems to me that somehow the turn off/on should be "symetric" (my brainfart at 2am)

ArchangeGabriel commented 12 years ago

Before you mean.

Not exactly a DSM call, it is ar more complicated to reach next level, and by the way I'm trying to reach the last one directly, as it is seamlessly the same.

starks commented 12 years ago

You found something new?

ArchangeGabriel commented 12 years ago

For the bug, maybe, we found some bugs in acpi_call, Lekensteyn is fixing them (by the way he has already fixed one, need to test it and to verify with him if there is anything else).

starks commented 12 years ago

On 10/18/2011 02:30 PM, Bruno Pagani wrote:

For the bug, maybe, we found some bugs in acpi_call, Lekensteyn is fixing them (by the way he has already fixed one, need to test it and to verify with him if there is anything else).

AFAIK, the bumblebee branch doesn't help. I still get panics.

Lekensteyn commented 12 years ago

@ArchangeGabriel The bugs I've found in acpi_call are memory corruption issues which would only appear in cases where the call returns a large buffer. I've just pushed it to Bumblebee-Project/acpi_call/master and will upload a new PPA package soon.

starks commented 12 years ago

https://launchpad.net/ubuntu/+source/linux/3.1.0-1.1

Still building, but I'm ready to have some fun in a few hours.

Hopefully nouveau will play nicer, but I haven't seen that so far with 3.0 and mupuf's or darktama's nouveau tree modules.

starks commented 12 years ago

And no dice. Still won't power on.

I have another avenue worth exploring though.

Take a look at byo-switcheroo.c's approach to DSM calls: https://github.com/awilliam/asus-switcheroo/blob/master/byo-switcheroo.c

#define UL30VT_DIS_OFF "_DSM {0xA0,0xA0,0x95,0x9D,0x60,0x00,0x48,0x4D,0xB3,0x4D,0x7E,0x5F,0xEA,0x12,0x9F,0xD4} 0x102 0x3 {0x2,0x0,0x0,0x0}"
#define UL30VT_DIS_ON "_DSM {0xA0,0xA0,0x95,0x9D,0x60,0x00,0x48,0x4D,0xB3,0x4D,0x7E,0x5F,0xEA,0x12,0x9F,0xD4} 0x102 0x3 {0x1,0x0,0x0,0x0}; !mdelay 100"
#define UL30VT_SWITCHTO_DIS "MXMX 0x1; MXDS 0x1; _DSM {0xA0,0xA0,0x95,0x9D,0x60,0x00,0x48,0x4D,0xB3,0x4D,0x7E,0x5F,0xEA,0x12,0x9F,0xD4} 0x102 0x2 {0x12,0x0,0x0,0x0}; !nouveau_fbcon_output_poll_changed"
#define UL30VT_SWITCHTO_IGD "MXMX 0x1; MXDS 0x1; _DSM {0xA0,0xA0,0x95,0x9D,0x60,0x00,0x48,0x4D,0xB3,0x4D,0x7E,0x5F,0xEA,0x12,0x9F,0xD4} 0x102 0x2 {0x11,0x0,0x0,0x0}"````

Lekensteyn commented 12 years ago

This is the method being called by the method in the second post:

If (LEqual (Arg2, 0x1A))
{
    CreateField (Arg3, 0x18, 0x02, OMPR)
    CreateField (Arg3, Zero, One, FLCH)
    If (ToInteger(FLCH))
    {
        Store (OMPR, \_SB.PCI0.PEG0.PEGP.OPCE)
    }
    Store (Buffer(0x04)
        {
            0x00, 0x00, 0x00, 0x00
        }, Local0)
    CreateField (Local0, Zero, One, OPEN)
    CreateField (Local0, 0x03, 0x02, CGCS)
    CreateField (Local0, 0x06, One, SHPC)
    CreateField (Local0, 0x18, 0x03, DGPC)
    CreateField (Local0, 0x1B, 0x02, HDAC)
    Store (One, OPEN)
    Store (One, SHPC)
    Store (0x02, HDAC)
    Store (One, DGPC)
    If (LNotEqual (\_SB.PCI0.PEG0.PEGP.SGST (), Zero))
    {
        Store (0x03, CGCS)
    }
    Return (Local0)
}

I've posted an analysis for an other _DSM function on the hybrid-graphics-linux mailing list.

ArchangeGabriel commented 12 years ago

I'm facing this error too now. Unable to use the card after disabling/enabling it.

Oct 27 14:12:06 Archange-U43Jc kernel: [ 1878.366598] nvidia 0000:01:00.0: setting latency timer to 64 Oct 27 14:12:06 Archange-U43Jc kernel: [ 1878.366604] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=none,decodes=none:owns=none Oct 27 14:12:06 Archange-U43Jc kernel: [ 1878.366633] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:0a70) installed Oct 27 14:12:06 Archange-U43Jc kernel: [ 1878.366634] NVRM: in this system is not supported by the 290.03 NVIDIA Linux Oct 27 14:12:06 Archange-U43Jc kernel: [ 1878.366634] NVRM: graphics driver release. Please see 'Appendix A - Oct 27 14:12:06 Archange-U43Jc kernel: [ 1878.366635] NVRM: Supported NVIDIA GPU Products' in this release's README, Oct 27 14:12:06 Archange-U43Jc kernel: [ 1878.366636] NVRM: available on the Linux graphics driver download page at Oct 27 14:12:06 Archange-U43Jc kernel: [ 1878.366637] NVRM: www.nvidia.com. Oct 27 14:12:06 Archange-U43Jc kernel: [ 1878.366645] nvidia: probe of 0000:01:00.0 failed with error -1 Oct 27 14:12:06 Archange-U43Jc kernel: [ 1878.366661] NVRM: The NVIDIA probe routine failed for 1 device(s). Oct 27 14:12:06 Archange-U43Jc kernel: [ 1878.366663] NVRM: None of the NVIDIA graphics adapters were initialized!

I'm using Kernel 3.1 and acpi_call 1.0.2 under Oneiric.

Dunno where is the problem, but it was working before (Natty with Kernel 3.0 and acpi_call 1.0.1-1).

andaag commented 12 years ago

Has anyone found any followups to this issue? Or fixes?

Either in ironhide/the bumblebee forks?

Lekensteyn commented 12 years ago

Yesterday I've found a possible reason: a corrupted PCI configuration space. If you're interested, you can read more on https://github.com/Bumblebee-Project/Bumblebee/wiki/ACPI-for-Developers I'm currently using the xorg-edgers/ppa with kernel 3.2 (from the same PPA) and nouveau on Kubuntu Oneiric 11.10 AMD64. It works perfect, no issues with suspending at all. As for the PM feature, bbswitch works for me but I'm still improving it by adding some safeguards (don't kill if a module is loaded) and the overwrite issue on resume.

andaag commented 12 years ago

Thanks! I was not familiar with bbswitch, I'll give that a try.

I'm currently fairly happy with the nvidia card completely off, and using the intel for gpu tasks. However, after reboots I'm often in a scenario where I can't turn the card off, BUT it's still generating heat. So I gotta do a hard shutdown, turn it back on, and then disable the card to save battery and reduce heat.

It'd be great to be able to turn the card off AND on again though. Right now I'd rather save the battery and keep it always off than always on. Does bbswitch allow you to disable and enable the card?

Lekensteyn commented 12 years ago

bbswitch is in an early development stage and not suitable for all Optimus models. Please join #bumblebee on Freenode so if you want to try it. It indeed allows you to enable/disable the card (it works for me at least)

andaag commented 12 years ago

Brilliant! It just works, been testing it a bit now. And it reduces the heat on my system as much as acpi_call did :)

Lekensteyn commented 12 years ago

It works, but it's not perfect yet because of the aforementioned issues: after suspend the PCI config space is messed up if the card was off (this is also true for acpi_call) and there is no check whether it's safe to disable or not (is a driver loaded?). Follow me on twitter or watch the repo for changes. Again, the tool is in an early development stage ;)

andaag commented 12 years ago

I put a check in /etc/pm/sleep.d/nvidia to enable the card on suspend and disable it again on resume if the nvidia driver isn't loaded (I dont use nouveau) :)

And I've followed both the git page and your twitter account, I'll be keeping an eye on this for sure. It's already a HUGE improvement to what I had before!

Lekensteyn commented 12 years ago

Conditions are known and related to the PCI config space (see wiki). Resolved with new PM method in BB3.

Bumblebee-Project / Bumblebee-old

Bug when modprobing nvidia in some conditions #133