Memory/data corruption / crash on Lenovo T440p (GT 730M).

leoluk commented 10 years ago

I just installed bbswitch on the newly-released Thinkpad T440p.

Loading the bbswitch module and disabling the card works perfectly fine:

[  142.881587] bbswitch: version 0.7 
[  142.881593] bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.VID_
[  142.881596] bbswitch: Found discrete VGA device 0000:02:00.0: \_SB_.PCI0.PEG_.VID_
[Package] (20130517/nsarguments-95)  
[  142.882097] bbswitch: detected an Optimus _DSM function
[  142.882106] pci 0000:02:00.0: enabling device (0004 -> 0007)
[  142.882127] bbswitch: Succesfully loaded. Discrete card 0000:02:00.0 is on
[  156.250136] bbswitch: disabling discrete graphics
[Package] (20130517/nsarguments-95)  
[  156.265409] thinkpad_acpi: EC reports that Thermal Table has changed
[  156.376985] pci 0000:02:00.0: power state changed by ACPI to D3cold

But enabling it again seems to mess up the PCI bus, crash the network adapters, and cause data corruption (files read from the disk contain random characters, filesystem errors, some files are missing, empty or filled with random data after a reboot).

[  160.406244] bbswitch: enabling discrete graphics
[  160.647323] pci 0000:02:00.0: power state changed by ACPI to D0 
[  160.647336] thinkpad_acpi: EC reports that Thermal Table has changed

There are no bbswitch errors, but shortly after entering the command, the syslog fills with various kernel messages related to internal devices no longer responding. The filesystem sometimes remounts as read-only, and the system becomes unusable and has to be reset.

At this point, the machine is unable to write files to the disk or a USB stick or communicate with the network, so I made some "screen shots" using my smartphone. I tried to redirect the syslog to another machine using the internal network as well as a USB WLAN adapter, but the data cuts off as soon as the graphics card is enabled.

Installing the other bumblebee components or the Nvidia/Nouveau drivers does not make any difference.

leoluk commented 10 years ago

Kernel version:

Linux 3.12.0-031200-generic #201311031935 SMP Mon Nov 4 00:36:54 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

OS: Linux Mint 16, Mainline kernel, happens with default (Ubuntu patched) 3.11 kernel and other distributions as well.

The system crashes at this line: https://github.com/Bumblebee-Project/bbswitch/blob/master/bbswitch.c#L289

If the bbswitch module is loaded during suspend, it crashes right away (obviously, as bbswitch enables it), if the module is unloaded, it crashes on resume (probably because the kernel sets the power state).

cat /proc/acpi/dump_info:

0000:00:00.0 060000 
0000:00:01.0 060400 \_SB_.PCI0.PEG0
0000:00:01.1 060400 \_SB_.PCI0.PEG_
0000:00:02.0 030000 \_SB_.PCI0.VID_
0000:00:03.0 040300 \_SB_.PCI0.B0D3
0000:00:14.0 0c0330 \_SB_.PCI0.XHCI
0000:00:16.0 078000 
0000:00:19.0 020000 \_SB_.PCI0.IGBE
0000:00:1a.0 0c0320 \_SB_.PCI0.EHC2
0000:00:1b.0 040300 \_SB_.PCI0.HDEF
0000:00:1c.0 060400 \_SB_.PCI0.EXP1
0000:00:1c.1 060400 \_SB_.PCI0.EXP2
0000:00:1d.0 0c0320 \_SB_.PCI0.EHC1
0000:00:1f.0 060100 \_SB_.PCI0.LPC_
0000:00:1f.2 010601 \_SB_.PCI0.SAT1
0000:00:1f.3 0c0500 \_SB_.PCI0.SMBU
0000:02:00.0 030000 \_SB_.PCI0.PEG_.VID_
0000:03:00.0 ff0000 
0000:04:00.0 028000

Launchpad gives me a timeout, here's the ACPI debug tarball:

http://media.leoluk.de/LENOVO-20AWS02A00.tar.gz

# echo "\_SB.PCI0.PEG.VID.ISOP" > /proc/acpi/call 
# cat /proc/acpi/call 
0xffffffff

leoluk commented 10 years ago

Interesting ACPI methods for \_SB.PCI0.PEG.VID. Calling PSOF 0 does not seem to switch it off, unfortunately.

For some reason, \WIN8 is 0x1 even if acpi_osi=Linux.

                    Method (_PS0, 0, NotSerialized)
                    {
                        If (LNot (VMSH))
                        {
                            GPON (0x00)
                        }
                    }

                    Method (_PS1, 0, NotSerialized)
                    {
                        Noop
                    }

                    Method (_PS2, 0, NotSerialized)
                    {
                        Noop
                    }

                    Method (_PS3, 0, NotSerialized)
                    {
                        If (LNot (VMSH))
                        {
                            GPOF (0x00)
                        }
                    }

                    Method (GPON, 1, NotSerialized)
                    {
                        If (ISOP ())
                        {
                            If (DGOS)
                            {
                                \VHYB (0x02, 0x00)
                                Sleep (0x64)
                                If (LEqual (ToInteger (Arg0), 0x00)) {}
                                \VHYB (0x00, 0x01)
                                Sleep (0x64)
                                \VHYB (0x02, 0x01)
                                Sleep (0x01)
                                \VHYB (0x08, 0x01)
                                Store (0x0A, Local0)
                                Store (0x32, Local1)
                                While (Local1)
                                {
                                    Sleep (Local0)
                                    If (\LCHK (0x01))
                                    {
                                        Break
                                    }

                                    Decrement (Local1)
                                }

                                \VHYB (0x08, 0x03)
                                \VHYB (0x04, 0x00)
                                \SWTT (0x01)
                                Store (Zero, DGOS)
                            }
                            Else
                            {
                                If (LAnd (LNotEqual (VSID, 0x220F17AA), LNotEqual (VSID, 0x221D17AA)))
                                {
                                    \VHYB (0x04, 0x00)
                                }
                            }

                            \VHYB (0x09, \_SB.PCI0.PEG.VID.HDAS)
                        }
                        Else
                        {
                            Store (0x220E17AA, VIDS)
                        }
                    }

                    Method (GPOF, 1, NotSerialized)
                    {
                        If (ISOP ())
                        {
                            If (LOr (VMSH, LEqual (\_SB.PCI0.PEG.VID.OMPR, 0x03)))
                            {
                                \SWTT (0x00)
                                \VHYB (0x08, 0x00)
                                Store (0x0A, Local0)
                                Store (0x32, Local1)
                                While (Local1)
                                {
                                    Sleep (Local0)
                                    If (\LCHK (0x00))
                                    {
                                        Break
                                    }

                                    Decrement (Local1)
                                }

                                \VHYB (0x08, 0x02)
                                \VHYB (0x02, 0x00)
                                Sleep (0x64)
                                \VHYB (0x00, 0x00)
                                If (LEqual (ToInteger (Arg0), 0x00)) {}
                                Store (One, DGOS)
                                Store (0x02, \_SB.PCI0.PEG.VID.OMPR)
                            }
                        }
                    }

                    Method (_STA, 0, NotSerialized)
                    {
                        Return (0x0F)
                    }

                    Method (_DSM, 4, NotSerialized)
                    {
                        If (\CMPB (Arg0, Buffer (0x10)
                                {
                                    /* 0000 */    0xF8, 0xD8, 0x86, 0xA4, 0xDA, 0x0B, 0x1B, 0x47,
                                    /* 0008 */    0xA7, 0x2B, 0x60, 0x42, 0xA6, 0xB5, 0xBE, 0xE0
                                }))
                        {
                            Return (NVOP (Arg0, Arg1, Arg2, Arg3))
                        }

                        If (\CMPB (Arg0, Buffer (0x10)
                                {
                                    /* 0000 */    0x01, 0x2D, 0x13, 0xA3, 0xDA, 0x8C, 0xBA, 0x49,
                                    /* 0008 */    0xA5, 0x2E, 0xBC, 0x9D, 0x46, 0xDF, 0x6B, 0x81
                                }))
                        {
                            Return (NVPS (Arg0, Arg1, Arg2, Arg3))
                        }

                        If (\WIN8)
                        {
                            If (\CMPB (Arg0, Buffer (0x10)
                                    {
                                        /* 0000 */    0x75, 0x0B, 0xA5, 0xD4, 0xC7, 0x65, 0xF7, 0x46,
                                        /* 0008 */    0xBF, 0xB7, 0x41, 0x51, 0x4C, 0xEA, 0x02, 0x44
                                    }))
                            {
                                Return (NBCI (Arg0, Arg1, Arg2, Arg3))
                            }
                        }

                        Return (Buffer (0x04)
                        {
                            0x01, 0x00, 0x00, 0x80
                        })
                    }

Lekensteyn commented 10 years ago

Linux tries to always report compatibility with Windows (such as \WIN8) because BIOS vendors write code that assumes that anything other than it is broken/outdated. The symptoms you described sound like a power shortage, could you observe similar problems when using the nouveau driver with dynamic power management enabled?

leoluk commented 10 years ago

No, I haven't installed either driver. How do I enable nouveau's dynamic power management?

Lekensteyn commented 10 years ago

Can you post your dmesg somewhere? Unless you blacklisted it, nouveau will get loaded (bumblebee does unload it before using bbswitch, so be sure to disable bumblebeed too). To enable dynamic PM, you can write to sysfs or use powertop to enable Runtime PM at tunables.

leoluk commented 10 years ago

This is what happened after loading the nouveau module:

[ 3144.325079] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Integer], ACPI requires [Package] (20130725/nsarguments-95)
[ 3144.325160] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95)
[ 3144.325346] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95)
[ 3144.325522] pci 0000:02:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported
[ 3144.325524] VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG_.VID_ handle
[ 3144.325546] nouveau 0000:02:00.0: enabling device (0004 -> 0007)
[ 3144.325660] [drm] hdmi device  not found 2 0 1
[ 3144.325767] nouveau E[  DEVICE][0000:02:00.0] unknown chipset, 0x108100a1
[ 3144.325769] nouveau E[     DRM] failed to create 0x80000080, -22
[ 3144.325856] nouveau: probe of 0000:02:00.0 failed with error -22

leoluk commented 10 years ago

I tried disabling/enabling the card after loading the nouveau module and enabling runtime PM, but it still crashed the system.

leoluk commented 10 years ago

Switching it off using acpi_call works fine (enabling still crashes the system):

# echo "\_SB.PCI0.PEG.VID._DSM {0xF8,0xD8,0x86,0xA4,0xDA,0x0B,0x1B,0x47,0xA7,0x2B,0x60,0x42,0xA6,0xB5,0xBE,0xE0} 0x100 0x1A {0x1,0x0,0x0,0x3}" > /proc/acpi/call ; cat /proc/acpi/call 
{0x59, 0x00, 0x00, 0x11}

# echo "\_SB.PCI0.PEG.VID._PS3" > /proc/acpi/call ; cat /proc/acpi/call 
0x2called

leoluk commented 10 years ago

Is there any way to prevent the card from being enabled / keep the system from crashing after a resume?

Lekensteyn commented 10 years ago

acpi_call does not work well with resume. You could try commenting out some problematic parts in bbswitch such that s/r still works.

leoluk commented 10 years ago

I tried this, but it did not work. As soon as the _PS0 function is called, the system crashes. Commenting out the ACPI calls in bbswitch_on (or even the entire function), the PM handlers or even unloading the module before suspending the system did not help (it still crashes on resume). If I don't disable the PM handler, it crashes before it suspends.

I installed Windows and tried enabling/disabling the card, which worked fine, so apparently it's doing something different.

xqms commented 10 years ago

I'm also observing this on my T440p (with the newest BIOS 1.17) with Ubuntu 13.10. As soon as I do

echo ON > /proc/acpi/bbswitch

I get weird memory corruption issues. It's apparently not a driver issue (noveau/nvidia) as the driver is not even loaded when I do the switch.

I already use

acpi_osi="!Windows 2012"

for other reasons (backlight control), but it does not help.

xqms commented 10 years ago

By the way, I get some ACPI warnings during the first module load:

[    5.100039] bbswitch: version 0.7
[    5.100042] bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.VID_
[    5.100046] bbswitch: Found discrete VGA device 0000:02:00.0: \_SB_.PCI0.PEG_.VID_
[    5.100052] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95)
[    5.100321] bbswitch: detected an Optimus _DSM function
[    5.100332] pci 0000:02:00.0: enabling device (0004 -> 0007)
[    5.100359] bbswitch: Succesfully loaded. Discrete card 0000:02:00.0 is on
[    5.101973] bbswitch: disabling discrete graphics
[    5.101980] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95)

Lekensteyn commented 10 years ago

Those warnings are harmless, see https://github.com/Bumblebee-Project/bbswitch/commit/ee0591b.

The memory corruption, etc. sound (as I said before) like issues related to power. @leoluk You stated that you have tried nouveau, in what way did you disable the nvidia card? Have you just enabled runtime PM and then waited for it to kick in? Or write to the vgaswitcheroo file in debugfs?

leoluk commented 10 years ago

The /sys/kernel/debug/vgaswitcheroo directory was missing, so I just tried all the methods I already knew, just with nouveau loaded and runtime PM enabled. But I just realized that's probably not what you were thinking about.

leoluk commented 10 years ago

How to reliably reproduce the problem:

Boot from any sufficiently recent Linux live image (I used Linux Mint 16 x64, kernel 3.11.0-12-generic). Don't use your existing installation (if you have any), because you'd risk messing it up.

Open a terminal and run this:

wget https://github.com/Bumblebee-Project/bbswitch/archive/master.zip; unzip master.zip; cd bbswitch-master; make; sudo make load; sudo tee /proc/acpi/bbswitch <<<OFF

Your kernel log (type dmesg) should show a message like this one: [ 4550.007526] bbswitch: Succesfully loaded. Discrete card 0000:02:00.0 is off
Now run sudo tee /proc/acpi/bbswitch <<<ON. If you just lost all network connectivity and the system gradually stops working, your device is affected as well.

Maybe this is useful for gathering more information (all T440p affected? only a small subset? just the latest BIOS version?).

leoluk commented 10 years ago

Partial success! I recompiled the mainline kernel with a modified DSDT table and commented out the entire \_SB.PCI0.PEG.VID.GPON method (which is called from _PS0 and the NVP3 power resource), preventing bbswitch, the kernel or anything else from powering on the card again. Powering down the card and subsequent suspend/resume is now working, Bumblebee obviously isn't.

The kernel understands that the power state change fails:

[ 408.603544] pci 0000:02:00.0: Refused to change power state, currently in D3

Now I'll just have to figure out why powering on the card is crashing the system. Any ideas? The Windows driver might use another mechanism to power down the card.

rkaw92 commented 10 years ago

Hi, this also occurs on my T440p. NVIDIA 730m, newest BIOS (did not test on the previous BIOS - it made the trackpoint completely unusable so I let it go before I even installed Linux). Until finding this bug report, I was wondering where the memory corruption was coming from...

Let me know if I can provide any additional information. Keep in mind that I'm not a hardware expert (and DSDT hacking is a thing I've never touched).

Lekensteyn commented 10 years ago

I wonder if it has something to do with bbswitch reading from the PCI configuration space to determine whether a card is available or not.

@leoluk If vgaswitcheroo is not available, I still would like to know if nouveau runtime PM exposes the issues experienced here. Watch for the DSM warnings (hey, useful debugging tool now :-) ) to see when the card gets disabled.

leoluk commented 10 years ago

@Lekensteyn If that helps: the problem can be triggered using only acpi_call, which does not seem to do anything related to the PCI configuration space. I'm not familiar with nouveau's runtime PM/Optimus support, but what I gathered from reading through the source is that it should automatically disable the card if it's idle, right? So I just load the module, enable automatic runtime PM, and wait? How do I enable it afterwards? By using PRIME, or disabling settting PM to "on"?

@rkaw92 If you're adventurous and submit your machine information before (see below), you could try the modified kernel and check if it prevents the memory corruption.

https://github.com/Bumblebee-Project/bbswitch#reporting-bugs

xqms commented 10 years ago

I just tried out nouveau on vanilla kernel 3.13.0-rc3 as the GT730M is not supported by nouveau in the stable kernel. Here is dmesg after it has loaded: https://gist.github.com/x-quadraht/7902666

Shortly after that the system crashed again with the known symptoms. I guess nouveau decided that the card was not in use and tried to disable it. After the crash I shortly saw two of the ACPI warnings, but dmesg was quickly flooded by the memory errors. I will try to capture that more precisely.

xqms commented 10 years ago

Here is a more complete log of the crash using nouveau: https://gist.github.com/x-quadraht/7902930

You can see virtuoso dying, as a first casualty of the memory corruption...

rkaw92 commented 10 years ago

Sorry, I haven't had the time to properly survey my system nor try the DSDT override. As a temporary workaround, I have decided to install acpi_call, which seems to work, provided that you never really need to power on the NVIDIA GPU. Blacklisted nouveau and uninstalled NVIDIA proprietary drivers, too.

echo "\_SB.PCI0.PEG.VID._DSM {0xF8,0xD8,0x86,0xA4,0xDA,0x0B,0x1B,0x47,0xA7,0x2B,0x60,0x42,0xA6,0xB5,0xBE,0xE0} 0x100 0x1A {0x1,0x0,0x0,0x3}" >/proc/acpi/call
echo "\_SB.PCI0.PEG.VID.GPOF" >/proc/acpi/call

leoluk commented 10 years ago

@rkaw92 That's what I tried first (actually, I disabled it with bbswitch and then unloaded the module), but the kernel enabled the card after resuming from standby and it still crashed the system, which is why I patched my DSDT table. Could you try if standby works for you using the manual ACPI calls?

rkaw92 commented 10 years ago

Two observations: A) Indeed, unloading the bbswitch module does not fix the problem - the kernel still (supposedly) attempts to re-enable the card at resume. I am not sure if the module gets re-loaded somehow (could not verify - dmesg, lsmod and friends got overwritten as soon as I resumed). B) Using just acpi_call with the methods outlined above, I am able to suspend/resume without issues. Thus, it seems to be different from using bbswitch. This is my temporary solution, which I've been using since yesterday with full success (for a workaround) and no discernible side effects apart from complete NVIDIA disablement.

leoluk commented 10 years ago

Interesting observation, this means that the DSDT override is not even necessary (but useful if you want to make sure that the card stays disabled). Possible explanation: bbswitch calls pci_save_state before it disables the device. On suspend, the card is already disabled so the state is not saved again. On resume, the kernel restores the previously saved state and enables the device.

leoluk commented 10 years ago

I ended up returning my T440p for unrelated reasons (fan noise, broken wifi), so I unfortunately cannot longer contribute to this bug report by debugging it.

A few suggestions:

apparently, there's someone on the Arch forums who has a T440p with an older BIOS revision where Bumblebee/bbswitch works - maybe compare the ACPI tables?
the AMLI debug extension for the Windows kernel debugger could be used to trace the methods called by the Nvidia driver (I can provide the checked acpi.sys for Windows 7 x64, if anyone is interested)

jhnphm commented 10 years ago

I'd be interested in taking a look at using the windows kernel debugger. I don't have any experience w/ Windows debugging though (although probably could figure it out from above documentation), and more problematically how to install the checked acpi.sys :|

seanvk commented 10 years ago

I can confirm the same filesystem corruption with bumblebee. I have the same T440p. I first noticed it right away after installing Manjaro LInux which defaults install with bbswitch enabled. I then switched to ArchLinux, from scratch install leaving out bbswitch. Saw no issues. I then installed bumblebee/bbswitch and within a short matter of minutes got filesystem corruption. I then installed Fedora which does not have it included in default install either and it is likewise stable.

abbradar commented 10 years ago

Hello, I have this problem, too, and I've managed to install Windows debuggers, symbols, checked acpi.sys and whatever else needed for ACPI debugging in Windows (what a pain...). I don't have any experience on this, though, and my blocker now is that ACPI event dump is too big (many pages even for second or two), and when getting it for, say, 30 seconds, WinDbg can't congest such a size at all. Maybe someone more familiar with ACPI debugging gives an advice on it? Some filter on output, maybe?

JohnDoe42 commented 10 years ago

Same problem here. No issues with original BIOS but filesystem crashes with 1.17. Now trying BIOS 1.14 and will report.

JohnDoe42 commented 10 years ago

Got a crashed filesystem with 1.17. Flashed back to 1.14 and the same filesystem is working now. The ext4 errors does not seem to damage the physical volume. Untill this issue is fixed, downgrading BIOS to 1.14 helps: http://download.lenovo.com/ibmdl/pub/pc/pccbbs/mobiles/gluj04us.iso

leoluk commented 10 years ago

It would be interesting to compare the ACPI dump from BIOS 1.14 and 1.17.

In some cases, there were damaged files after multiple crashes on my machine.

JohnDoe42 commented 10 years ago

Complete ACPI Dump: http://pastebin.com/raw.php?i=EGzeEy69

ACPI summary from BIOS 1.14:

ACPI: RSDP 0xbcefe014 00024 (v02 LENOVO)
ACPI: RSDT 0xbcefe0d4 0008C (v01 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: XSDT 0xbcefe170 000F4 (v01 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: DSDT 0xbcee1000 11663 (v01 LENOVO TP-GL    00001140 INTL 20120711)
ACPI: FACS 0xbce4a000 00040
ACPI: FACP 0xbcef8000 0010C (v05 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: SLIC 0xbcefd000 00176 (v01 LENOVO TP-GL    00001140 PTEC 00000001)
ACPI: DBGP 0xbcefb000 00034 (v01 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: ECDT 0xbcefa000 00052 (v01 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: HPET 0xbcef7000 00038 (v01 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: APIC 0xbcef6000 00098 (v01 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: MCFG 0xbcef5000 0003C (v01 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: SSDT 0xbcef4000 00033 (v01 LENOVO TP-SSDT1 00000100 INTL 20120711)
ACPI: SSDT 0xbcef3000 0044F (v01 LENOVO TP-SSDT2 00000200 INTL 20120711)
ACPI: SSDT 0xbcee0000 00B75 (v01 LENOVO SataAhci 00001000 INTL 20120711)
ACPI: SSDT 0xbcedf000 0076F (v01 LENOVO  Cpu0Ist 00003000 INTL 20120711)
ACPI: SSDT 0xbcede000 00AD8 (v01 LENOVO    CpuPm 00003000 INTL 20120711)
ACPI: SSDT 0xbcedc000 01215 (v01 LENOVO  SaSsdt  00003000 INTL 20120711)
ACPI: SSDT 0xbcedb000 00379 (v01 LENOVO CppcTabl 00001000 INTL 20120711)
ACPI: PCCT 0xbceda000 0006E (v05 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: SSDT 0xbced9000 00AC4 (v01 LENOVO Cpc_Tabl 00001000 INTL 20120711)
ACPI: TCPA 0xbced8000 00032 (v02    PTL   LENOVO 06040000 LNVO 00000001)
ACPI: UEFI 0xbced7000 00042 (v01 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: POAT 0xbcdb2000 00055 (v03 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: ASF! 0xbcefc000 000A5 (v32 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: BATB 0xbced6000 00046 (v01 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: FPDT 0xbced5000 00064 (v01 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: UEFI 0xbced4000 002E2 (v01 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: SSDT 0xbced3000 0047F (v01 LENOVO IsctTabl 00001000 INTL 20120711)
ACPI: BGRT 0xbced2000 00038 (v01 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: DMAR 0xbced1000 000B8 (v01 LENOVO TP-GL    00001140 PTEC 00000002)
ACPI: SSDT (nil) 00436 (v01  PmRef  Cpu0Cst 00003001 INTL 20120711)
ACPI: SSDT (nil) 005AA (v01  PmRef    ApIst 00003000 INTL 20120711)
ACPI: SSDT (nil) 00119 (v01  PmRef    ApCst 00003000 INTL 20120711)

leoluk commented 10 years ago

Could you upload a debug tarball as explained here?

That way, I could directly compare it with the one I created using BIOS 1.17.

@abbradar: You could try to trace the Optimus _DSM method and look at the parameters given by the Nvidia driver while enabling and disabling the card. Unfortunately, I never did this before and without a working test environment I can't really provide you with any instructions on how to do this.

JohnDoe42 commented 10 years ago

Tar file uploaded here: http://www.file-upload.net/download-8486594/LENOVO-20AWS02A00.tar.gz.html

The acpi dump_info:

0000:00:00.0 060000 
0000:00:01.0 060400 \_SB_.PCI0.PEG0
0000:00:01.1 060400 \_SB_.PCI0.PEG_
0000:00:02.0 030000 \_SB_.PCI0.VID_
0000:00:03.0 040300 \_SB_.PCI0.B0D3
0000:00:14.0 0c0330 \_SB_.PCI0.XHCI
0000:00:16.0 078000 
0000:00:19.0 020000 \_SB_.PCI0.IGBE
0000:00:1a.0 0c0320 \_SB_.PCI0.EHC2
0000:00:1b.0 040300 \_SB_.PCI0.HDEF
0000:00:1c.0 060400 \_SB_.PCI0.EXP1
0000:00:1c.1 060400 \_SB_.PCI0.EXP2
0000:00:1d.0 0c0320 \_SB_.PCI0.EHC1
0000:00:1f.0 060100 \_SB_.PCI0.LPC_
0000:00:1f.2 010601 \_SB_.PCI0.SAT1
0000:00:1f.3 0c0500 \_SB_.PCI0.SMBU
0000:02:00.0 030000 \_SB_.PCI0.PEG_.VID_
0000:03:00.0 ff0000 
0000:04:00.0 028000

abbradar commented 10 years ago

I've got my feet wet in Windows ACPI debugging (what a pain, really...) and traced nvidia ACPI calls, but it really do not differ in big from bbswitch/acpi_call sequence:

Calls:
Arg0 = dsm_optimus
Arg1 = 0x100
Arg2 = 0x1b
Arg3 = {0, 0, 0, 0}

Arg0 = dsm_optimus
Arg1 = 0x100
Arg2 = 0x1a
Arg3 = {1, 0, 0, 3}

Disable:
AMLI: FFFFE0000B9FC040: EvalNameSpaceObject(\_SB.PCI0.PEG.VID._DSM) // 0x1b, {0, 0, 0, 0}
AMLI: FFFFE000002FB340: EvalNameSpaceObject(\_SB.PCI0.PEG.VID._DSM) // 0x1a, {1, 0, 0, 3}
AMLI: FFFFE000002FB340: AsyncEvalObject(\_SB.PCI0.PEG.VID._STA)
AMLI: FFFFE00000364680: AsyncEvalObject(\_SB.PCI0.PEG.VID._PS3)
AMLI: FFFFE00005156880: EvalNameSpaceObject(\_SB.PCI0.LPC.EC.HKEY.MHKP)
AMLI: FFFFE0000868B080: EvalNameSpaceObject(\_SB.PCI0.LPC.EC.HKEY.MHKV)
AMLI: FFFFE00000364680: AsyncEvalObject(\_SB.PCI0.PEG.VID._STA)

Enable:
AMLI: FFFFE000002FB340: AsyncEvalObject(\_SB.PCI0.PEG.VID._STA)
AMLI: FFFFE00000364680: AsyncEvalObject(\_SB.PCI0.PEG.VID._PS0)
AMLI: FFFFE00005156880: EvalNameSpaceObject(\_SB.PCI0.LPC.EC.HKEY.MHKP)
AMLI: FFFFE00000364680: AsyncEvalObject(\_SB.PCI0.PEG.VID._STA)
AMLI: FFFFE0000868B080: EvalNameSpaceObject(\_SB.PCI0.LPC.EC.HKEY.MHKV)
AMLI: FFFFE0000B9FC040: EvalNameSpaceObject(\_SB.PCI0.PEG.VID._DSM) // 0x1b, {0, 0, 0, 0}

, where 0x1b is some kind of status call AFAIU. I've tried to reproduce exactly the same sequence with acpi_call and it corrupts memory anyway; I've also tried unloading nvidia, nouveau, i915 and setting acpi_osi to "Windows 2013" and "!Windows 2013" (what's the difference?) to no avail. Any other ideas?

JohnDoe42 commented 10 years ago

Does anyone have tried another filesystem than ext4? Maybe its a kernel related issue?

jhnphm commented 10 years ago

It happens on btrfs too, and affects more than the filesystem.

abbradar commented 10 years ago

I think that maybe nvidia initializes itself in some way differing from Linux. I've made an ACPI trace of whole Windows boot; it might be interesting. http://pastebin.com/JvWTD3Fe My eye catches this particular calls:

...
AMLI: FFFFE00006B4D880: EvalNameSpaceObject(\_SB.PCI0.PEG.VID._DSM)
AMLI: FFFFE00006B4D880: EvalNameSpaceObject(\_SB.PCI0.PEG.VID._DSM)
String(:Str="------- NBCI DSM --------")
AMLI: FFFFE00006B4D880: EvalNameSpaceObject(\_SB.PCI0.PEG.VID._DSM)
String(:Str="------- NV OPTIMUS DSM --------")
AMLI: FFFFE00006B4D880: EvalNameSpaceObject(\_SB.PCI0.PEG.VID._DSM)
AMLI: FFFFE00006B4D880: EvalNameSpaceObject(\_SB.PCI0.PEG.VID._DSM)
String(:Str="------- NV GPS DSM --------")
AMLI: FFFFE00006B4D880: EvalNameSpaceObject(\_SB.PCI0.PEG.VID._DSM)
AMLI: FFFFE00006B4D880: EvalNameSpaceObject(\_SB.PCI0.PEG.VID._DSM)
AMLI: FFFFE00006B4D880: EvalNameSpaceObject(\_SB.PCI0.PEG.VID._ROM)
...

abbradar commented 10 years ago

I've set breakpoints at _DSM and printed arguments for all calls. There is quite a bit, actually, and with various arguments; just grep this by _DSM and look for calls after which OPTIMUS something is printed. Going to sleep now, maybe someone will try calling some of them with acpi_call and then trying to disable/enable card? http://pastebin.com/M149gYM0 P.S.: I don't think so for Optimus, but it might be dangerous to call random ACPI functions; you may risk breaking your hardware!

leoluk commented 10 years ago

@JohnDoe42 The filesystem corruption you're seeing is just one of many issues. The network cards become unresponsive as well, for example. I compared your ACPI dump and mine, and there don't seem to be any significant changes to the methods related to Optimus.

@abbradar Nice! This is very interesting. As far as I know, this is the first time we're looking at what the Windows driver is actually doing.

\_SB.PCI0.PEG.VID._DSM is passing its arguments to either NVOP (Optimus), NVPS (GPS?) or NBCI (unknown, only used by Win8) depending on the MUID in the first argument. The interesting stuff happens in these functions.

 Method (_DSM, 4, NotSerialized)  // _DSM: Device-Specific Method
 {
     If (\CMPB (Arg0, Buffer (0x10)
             {
                 /* 0000 */   0xF8, 0xD8, 0x86, 0xA4, 0xDA, 0x0B, 0x1B, 0x47,
                 /* 0008 */   0xA7, 0x2B, 0x60, 0x42, 0xA6, 0xB5, 0xBE, 0xE0
             }))
     {
         Return (NVOP (Arg0, Arg1, Arg2, Arg3)) 
     }

     If (\CMPB (Arg0, Buffer (0x10)
             {
                 /* 0000 */   0x01, 0x2D, 0x13, 0xA3, 0xDA, 0x8C, 0xBA, 0x49,
                 /* 0008 */   0xA5, 0x2E, 0xBC, 0x9D, 0x46, 0xDF, 0x6B, 0x81
             }))
     {
         Return (NVPS (Arg0, Arg1, Arg2, Arg3)) 
     }

     If (\WIN8)
     {
         If (\CMPB (Arg0, Buffer (0x10)
                 {
                     /* 0000 */   0x75, 0x0B, 0xA5, 0xD4, 0xC7, 0x65, 0xF7, 0x46,
                     /* 0008 */   0xBF, 0xB7, 0x41, 0x51, 0x4C, 0xEA, 0x02, 0x44
                 }))
         {
             Return (NBCI (Arg0, Arg1, Arg2, Arg3)) 
         }
     }

     Return (Buffer (0x04)
     {
          0x01, 0x00, 0x00, 0x80
     })
 }

The MUID for NVOP is used by bbswitch: https://github.com/Bumblebee-Project/bbswitch/blob/master/bbswitch.c#L64

bbswitch calls it with the parameters {MUID, 0x100 (revid), 0x1A (function), {1, 0, 0, 3} (args)} to disable it (https://github.com/Bumblebee-Project/bbswitch/blob/master/bbswitch.c#L106) and sets the power state to PS3 (which results in a GPOF call). For enabling, it just sets the PCI power state to PS0 (GPON call) without calling any DSM. In your dump, there are three different DSM calls with the Optimus MUID.

{MUID, 0x100, 0x10, {0, 0, 0x4B, 0x56}
{MUID, 0x100, 0x1A, {1, 0, 0, 3} (we know this one!)
{MUID, 0x100, 0x1B, {0, 0, 0, 0}

We don't know what the other two are doing. This is how you'd manually call the first one:

echo "\_SB.PCI0.PEG.VID._DSM {0xF8,0xD8,0x86,0xA4,0xDA,0x0B,0x1B,0x47,0xA7,0x2B,0x60,0x42,0xA6,0xB5,0xBE,0xE0} 0x100 0x10 {0x0,0x0,0x4B,0x56}" >/proc/acpi/call

I no longer have my device, so I can't try it. It should be safe to call these. It would also be interesting to switch the card on and off on Windows and directly correlate this to the resulting ACPI calls. I found a tool here which shows the GPU state.

Windows is able to enable/disable the card without unloading drivers or restarting applications, which is more advanced than what we're doing with Bumblebee right now (this is what is called the Optimus technology).

JohnDoe42 commented 10 years ago

New BIOS available: http://download.lenovo.com/ibmdl/pub/pc/pccbbs/mobiles/gluj07us.iso

Anyone dare to try?

leoluk commented 10 years ago

According to the BIOS changelog, there are no bug fixes except an external card reader not being recognized. You can always go back to a later version (unless you enabled the rollback prevention in the BIOS), so it shoudn't hurt to try it. Lenovo probably does not know about the Optimus crashes anyway.

abbradar commented 10 years ago

Maybe someone will report this to Lenovo? I'm not at home now and can't do this, I'll continue trying ACPI magic later.

leoluk commented 10 years ago

@abbradar Reporting it to Lenovo probably means creating a new topic here.

(by the way… Github messed up your email comment, might want to edit it later ;) )

abbradar commented 10 years ago

Updated BIOS, as anticipated, it hasn't changed anything in way of this problem. I've posted this problem to Lenovo at Lenovo forums.

abbradar commented 10 years ago

Also, I've tried to repeat init sequence with acpi_call (there are actually more various calls to Optimus than @leoluk listed), no luck so far. My calling sequences: init disable enable Maybe some ideas on improving them? All these have been made to mimic Windows nvidia driver calls (see debugger logs above).

Lekensteyn commented 10 years ago

The 0x1A method changes the power state, but it does not take care of HDMI. The 0x1B is used to query for supported capabilities.

You may also be interested in https://git.kernel.org/linus/5addcf0a5f0fadceba6bd562d0616a1c5d4c1a4d

abbradar commented 10 years ago

Can someone provide orginal DSDT table from 1.14? I want to try to load it in kernel with 1.17 BIOS and see if it fixes the problem to isolate the cause.

Bumblebee-Project / bbswitch

Memory/data corruption / crash on Lenovo T440p (GT 730M). #78