Bumblebee-Project / bbswitch

Disable discrete graphics (currently nvidia only)
GNU General Public License v2.0
489 stars 78 forks source link

Memory/data corruption / crash on Lenovo T440p (GT 730M). #78

Open leoluk opened 10 years ago

leoluk commented 10 years ago

I just installed bbswitch on the newly-released Thinkpad T440p.

Loading the bbswitch module and disabling the card works perfectly fine:

[  142.881587] bbswitch: version 0.7 
[  142.881593] bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.VID_
[  142.881596] bbswitch: Found discrete VGA device 0000:02:00.0: \_SB_.PCI0.PEG_.VID_
[Package] (20130517/nsarguments-95)  
[  142.882097] bbswitch: detected an Optimus _DSM function
[  142.882106] pci 0000:02:00.0: enabling device (0004 -> 0007)
[  142.882127] bbswitch: Succesfully loaded. Discrete card 0000:02:00.0 is on
[  156.250136] bbswitch: disabling discrete graphics
[Package] (20130517/nsarguments-95)  
[  156.265409] thinkpad_acpi: EC reports that Thermal Table has changed
[  156.376985] pci 0000:02:00.0: power state changed by ACPI to D3cold

But enabling it again seems to mess up the PCI bus, crash the network adapters, and cause data corruption (files read from the disk contain random characters, filesystem errors, some files are missing, empty or filled with random data after a reboot).

[  160.406244] bbswitch: enabling discrete graphics
[  160.647323] pci 0000:02:00.0: power state changed by ACPI to D0 
[  160.647336] thinkpad_acpi: EC reports that Thermal Table has changed

There are no bbswitch errors, but shortly after entering the command, the syslog fills with various kernel messages related to internal devices no longer responding. The filesystem sometimes remounts as read-only, and the system becomes unusable and has to be reset.

At this point, the machine is unable to write files to the disk or a USB stick or communicate with the network, so I made some "screen shots" using my smartphone. I tried to redirect the syslog to another machine using the internal network as well as a USB WLAN adapter, but the data cuts off as soon as the graphics card is enabled.

Installing the other bumblebee components or the Nvidia/Nouveau drivers does not make any difference.

leoluk commented 10 years ago

JohnDoe42 uploaded an ACPI dump for 1.14: https://github.com/Bumblebee-Project/bbswitch/issues/78#issuecomment-31620264

I compared it with 1.17 and there were no significant differences, but it might still be worth a try.

abbradar commented 10 years ago

Aren't there only decompiled code? I would prefer to try with original tables for purity of experiment rather than recompile them with IASL. They can be found in /sys/class/firmware/acpi (if I remember correctly).

P.S. awaiting another mess up from github.

JohnDoe42 commented 10 years ago

So you want a cat /sys/class/firmware/acpi > file from BIOS 1.14, right?

abbradar commented 10 years ago

As far as I remember, there are dsdt.dat and other files in this folder, so what I want is all of them.

abbradar commented 10 years ago

Back home and have checked paths: I've been mistaken, I need /sys/firmware/acpi/tables/* from BIOS 1.14.

JohnDoe42 commented 10 years ago

/sys/firmware/acpi/tables directory with BIOS 1.14: http://www.file-upload.net/download-8498324/tables.tar.html

abbradar commented 10 years ago

Exactly what I've asked for, thanks. I'll play around with this later.

abbradar commented 10 years ago

I've got a hang of ACPI table overriding and tried to replace all different tables from the new BIOS to old ones -- no luck, still getting corruption. I think we can conclude that problem is not really ACPI-based, but lies somewhere else...

abbradar commented 10 years ago

New insights: I've tested 1.15 BIOS and it's not working as well, and ACPI diferences between it and 1.14 is really minimal (2-3 lines with constants). Interestingly enough, PCIE on 1.15 works with 8 GT/s speed when 1.14 has 2.5GT/s everywhere and also they have different values of some flag for DRAM controller. IRQs are different too, all this can be seen with lspci diff. I'll now make dmesg logs for both BIOSes and try to compare them.

abbradar commented 10 years ago

Just because I could, I've also tried to use ACPI tables from newer BIOS in the 1.14: it doesn't affect this bug at all, so ACPI is not related.

abbradar commented 10 years ago

That's all for my today's adventure; I'm out of ideas for now, though lspci difference may give someone more proficient in this stuff a hint, since this is the only major difference between 1.14 and 1.15 that I have found.

abbradar commented 10 years ago

This is my results of get-acpi-info.sh and dmesg for 1.14 and 1.15 (lspci results are included): http://d-h.st/Wi0 P.S.: I want to apologize for my habit of posting short notes on each idea or discovery of something new. Maybe I should edit old posts instead?

abbradar commented 10 years ago

For some reason I thought that disabling nvidia on the latest bios will break suspend (and thus is unusable), but this is not true -- I can confirm that this works completely and can be used as workaround if you don't need nvidia and want to stick with newer firmwares.

JanBessai commented 10 years ago

@abbradar does the latest bios offer a switch to disable the nvidia card? Mine (1.14) doesn't.

I've got the same laptop and can offer further testing.

abbradar commented 10 years ago

No switch, but acpi_call works good and survives between standbyes. No ideas for now what else I can try to investigate.

JohnDoe42 commented 10 years ago

BIOS iso file for BIOS 1.19 online: http://download.lenovo.com/ibmdl/pub/pc/pccbbs/mobiles/gluj08us.iso

Any volunteers?

JohnDoe42 commented 10 years ago

Well the iso isn't 1.19 but 1.18. I tried it an this BIOS is getting optimus working for me (arch 3.12.7). Can anybody confirm?

Having 60 fps with intel and 2250 with 730 when executing glxgears.

abbradar commented 10 years ago

Hmmm, it's not working for me, iwlwifi is failing short after enabling nvidia via acpi_call and usual stuff about unreadable elf files happens. Can you please check it again and, if it really works, post your BIOS version (from SETUP), dmesg, lspci and lsmod and how you are starting glxgears? P.S. I've updated with Windows version.

abbradar commented 10 years ago

I've flashed 08us today, and this continues...

abbradar commented 10 years ago

I've checked this with both bumblebee (optirun) and acpi_call, flashed 1.14 and then 1.18 from gluj08us (for testing).No progress, also BIOS is not listed on download page -- how had you found it? It would be interesting to see a changelog for this version. Also, maybe we have some differences in laptop models? From SETUP with1.18 from gluj08us on my hardware: UEFI BIOS Version: GLET64WW (2.18) Date: 2013-12-18 EC Version: GLHT25WW (1.08) ME Firmware Version: 9.0.22.1467 Machine Type Model: 20AN0037RT

JanBessai commented 10 years ago

I cannot find a link either. But the changelog follows their usual naming convention, so you find it under: http://download.lenovo.com/ibmdl/pub/pc/pccbbs/mobiles/gluj08us.txt

JohnDoe42 commented 10 years ago

http://download.lenovo.com/ibmdl/pub/pc/pccbbs/mobiles/gluj08us.iso is version 2.18 - my fault. Lenovo is just counting up the number in the iso filename so the next iso and exe name is predictable. While the dev-team has fully tested this build, the web team needs some time to add it to the webpage. So it's save to install it.

Model: 20AWS02A00 Versions are the same.

abbradar commented 10 years ago

Changelog shows changes for NVIDIA (Strange ones, however, do they mean Optimus now works for 8.1? I have Win8.1 as second OS for games, so I can with sure say that Optimus works there ^_^). No luck for me, though, maybe differences in hardware or something? Has this helped anyone else except @JohnDoe42, and if yes, can they also post their dmesg, lspci, lsmod and model numbers? Maybe we can understand what differs.

abbradar commented 10 years ago

P.S. @JohnDoe42 Have you deleted your ~18-hours-ago comment about how you start glxgears (and something else about when you upload logs and ISOs, if I remember correctly), or it's github's bug that I received e-mail notification but there is no comment on the page?

JohnDoe42 commented 10 years ago

Deleted it because I made a irritating statement on the version of the latest bios iso.

glxgears: "glxgears" for intel and "optirun glxgears" for nvidia

abbradar commented 10 years ago

So, @JohnDoe42, anyone else whose problem was fixed with 2.18 -- can you provide dmesg, lspci and lsmod (after switching card off and on)? Maybe there were some specific module or kernel parameters?

JohnDoe42 commented 10 years ago

Sorry for delay. Yesterday I started Laptop as always but got this shitty failure again. There were several boots, where the 2.18 worked. Now I am back on 1.14 again. :-1:

abbradar commented 10 years ago

Bad news, but this is surely interesting. However, the lesson that I extracted from my earlier acquaintance with firmwares and hardware world is that these things are so unpredictable, non-obvious and buggy that this would be a normal behaviour. ^_^

abbradar commented 10 years ago

I just had one very interesting idea about using min_addr kernel parameter to find range at which memory corruption happens. I'll pull it off later. Not sure about any use of this, but if this range is fairly small, it can be used as a new workaround, and also can be used by lenovo developers as hint to the problem (if they are doing anything on this matter at all).

abbradar commented 10 years ago

I had very strange results, using memmap: If I block whole 0x100000000-0x33e600000 block of memory (upper 9.5G), nvidia works without any problems, but if I make any part of this memory avaliable (say, 0x270000000-0x280000000), there are usual crashes. I've also compared memory maps of Linux and Windows -- on Windows, additional region 0x102000-0x102fff is reserved, but this doesn't seem to affect anything at all (I've tried to reserve that block on Linux, too). Maybe some mm/bios/whatever-not-sure-what-this-belongs-to expert can comment on this? The only conclusion that I can make right now is that whole upper 9.5G memory block is corrupted when nvidia enables, but this sounds like nonsense. Can someone point out where can we ask? How can we test it?

abbradar commented 10 years ago

Kernel parameters to reproduce my results: memmap=99G$0x100000000 (use \$ in GRUB) Use dmesg | less and look at the top messages for memory maps. You can also cat /proc/iomem.

abbradar commented 10 years ago

Okay, this goes more and more cryptic. I've written an ugly but working memory checker which can fill reserved memory regions with magic bytes, and then check if these bytes are still the same. Source is avaliable here: https://gist.github.com/abbradar/74b9b05e5c25449a8f8d, compile with gcc --std=c99 -O3 or it'll be slow as hell. So, if I reserve upper memory with memmap, fill it with magic values, enable nvidia and check it, it remains consistent -- nothing is corrupted in upper memory at all! This becomes more and more strange. Somebody with some knowledge of how this sh*t works would be veeeery helpful now. Maybe my test is wrong?

Lekensteyn commented 10 years ago

Mapped memory may be as well mapped to device I/O. Check /proc/iomem for addresses that should not be touched.

abbradar commented 10 years ago

I've already done that, and everything seems to be okay... Wait, it's not (or I don't get one thing). According to /proc/iomem: 100000000-33e5fffff : System RAM 33e600000-33fffffff : RAM buffer The first is affected region, but it should end on 33e5fffff and this should be end of system memory! It's also according to dmesg: [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000033e5fffff] usable (this is the last line for BIOS-e820) Where did the second one come from, I wonder? Nevertheless, if I enable even just a chunk from this region (from the middle), everything crashes as usual. Also, on Windows, physical memory also ends on 0x33E600000, no regions are used after that (or shown in RamMap, at least)

Edit: Also, when I tried to write after 33e5fffff with my tool, funny things happened (it all ended with MCE and force reboot).

Edit2: Just tried, and corruption happened when I reserved just region 110000000-... (so 100000000-10fffffff could be used). There was no RAM buffer in /proc/iomem, just this small region as System RAM.

Lekensteyn commented 10 years ago

FWIW, I don't have a RAM buffer in my /proc/iomem. Can you post a full dmesg on gist (including the BIOS-provided RAM map)?

abbradar commented 10 years ago

Sure, https://gist.github.com/abbradar/8588742

Lekensteyn commented 10 years ago

Can you try removing "acpi_osi=!Windows 2012"? It was mentioned in this thread that brightness controls are affected by this, but possibly other features as well.

This is taken from your logs:

[ 0.431895] e820: reserve RAM buffer [mem 0x00058000-0x0005ffff]
[ 0.431896] e820: reserve RAM buffer [mem 0x0009c000-0x0009ffff]
[ 0.431897] e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]
[ 0.431898] e820: reserve RAM buffer [mem 0xb027e000-0xb3ffffff]
[ 0.431899] e820: reserve RAM buffer [mem 0xba911000-0xbbffffff]
[ 0.431901] e820: reserve RAM buffer [mem 0xbcf00000-0xbfffffff]
[ 0.431902] e820: reserve RAM buffer [mem 0x33e600000-0x33fffffff]

Could this memory be related to EFI? My laptop does not have such a high range, but a desktop with EFI does have one. I don't know if it is really relevant either.

abbradar commented 10 years ago

No luck with removing acpi_osi, what can I try with EFI? I've tried add_efi_memmap, with no changes. Also, on Windows there is matching system RAM map (excluding some small region at lower memory, but I've tried to reserve it on Linux with no changes, too).

leoluk commented 10 years ago

The i7-4800MQ has Intel vPro and a IOMMU (VT-d), the i7-4700MQ, which is not affected, hasn't. Might be related?

abbradar commented 10 years ago

Oh, great piece of information indeed (that it works on some models)! From where is it known? I haven't seen anything about that in the thread. Anyway, thanks you for that hint! I'll try to find something right now. Any other facts on this, like model numbers?

leoluk commented 10 years ago

The T540p is available with a i7-4700MQ and is not affected. Haven't heard of a T440p with that CPU though.

abbradar commented 10 years ago

Thanks again, I'll check this. Some other differences? Does it have latest BIOS version?

JohnDoe42 commented 10 years ago

Just tried 2.18 (btw: available at lenovo.com) with deactivated vt-d. Ran into ext4 errors on 1st boot. My t440p is a 4800MQ.

abbradar commented 10 years ago

I tried that too and have same problems, unfortunately. Oh, just checked my CPU model and I have 4700MQ, so it does not make a difference at all. Sad, I had a small hope ^_^.

JohnDoe42 commented 10 years ago

:disappointed:

abbradar commented 10 years ago

Oh well.... I have some experience writing Linux kernel modules, and one idea.... I'll try to write a kernel module that reserves that "problematic" region of upper memory and checks it.This way the region will be "avaliable", but blocked (other way than memmap), maybe that will make a difference -- than we can ask on LKML what can it mean.

Edit: from what I understood looking at devmem code, there are also some caching issues with memory access that I don't fully understand. I can try to read and write memory anyway, but it looks like the results of test would be (and maybe was, with my tool earlier) incorrect.

JohnDoe42 commented 10 years ago

http://forums.lenovo.com/t5/W-Series-ThinkPad-Laptops/HOWTO-Brick-a-W540-in-easy-steps/m-p/1414465/highlight/true#M43530

Maybe this optimus bug is something similar... That ist software development as it's best....

abbradar commented 10 years ago

Sounds very bad for those guys. I also have a very negative experience about warranty quality and service of other vendors in my country (Russia, usually you just cannot prove that this is hardware fault, even in court, and have to pay for motherboard replacement, also taking your hardware for 2-3 months for "analysis" is normal), so this bug would probably be very bad to get into in my country. Have yet to learn how is it with Lenovo, though, I hope that it's better than others for this money ^_^ Anyway, from my experiments with kernel module, it looks like the problem is with corruption of PCI-bound memory areas. Don't know though why does disabling all memory higher that 4G solves that...

perryh commented 10 years ago

Hi, is this still an ongoing issue? I'm close to ordering a T440p, but I'm concerned about this issue. My configuration has the i5-4200m processor with the Nvidia GPU.

abbradar commented 10 years ago

Hello, Unfortunately, yes. Working workarounds are either downgrading bios to 1.14 (solves the problem completely) or using acpi_call to disable nvidia completely (if you are not using nvidia and are concerned about power consumption only).