QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
541 stars 48 forks source link

Input/Output Errors and PCI devices unavailable after suspend #3049

Closed danjeffery closed 6 years ago

danjeffery commented 7 years ago

Qubes OS version (e.g., R3.2):

3.2 and 4.0rc1

Affected TemplateVMs (e.g., fedora-23, if applicable):

dom0, sys-net


Expected behavior:

Qubes can be suspended and recover from suspend

Actual behavior:

After suspend Qubes is unstable. Behavior is inconsistent. Sometimes networking is just disabled and nmcli in the sys-net system VM reports that the ethernet and wireless devices are unavailable and system is otherwise fine. At other times the sys-net VM is unresponsive or shuts down completely and dom0 gives input/output errors when attempting to open terminals or shutdown. The errors from dom0 are also bizarre as they affect different command from one test to the next. Sometimes lspci will throw the error, other times dmesg, ls or less will throw an error and lspci is fine. Often initctl will give the input/output error and the system must be restarted.

Steps to reproduce the behavior:

Suspend qubes (close lid, use menu and echo mem > /sys/power/state all produce the same result) Awaken (lift lid, push power button, it ignores keystrokes on my laptop)

General notes:

Hardware is a Lenovo X1 Carbon gen3, wireless adapter is Intel 7265 rev 59. I've run full system diagnostics on the laptop and it passes. I've tested different nvme drives with no benefit.

I've tried the steps from https://github.com/QubesOS/qubes-issues/issues/2922 and https://www.qubes-os.org/doc/wireless-troubleshooting/#automatically-reloading-drivers-on-suspendresume, but they have not helped.


Related issues:

https://groups.google.com/forum/#!topic/qubes-users/LkP-6ORGwME https://github.com/QubesOS/qubes-issues/issues/2922

andrewdavidwong commented 7 years ago

This sounds like it might be a duplicate of #3008. The workaround is to blacklist iwlmvm. (See issue comments for details.)

andrewdavidwong commented 7 years ago

(If it turns out not to be a duplicate, let me know, and we can reopen this.)

danjeffery commented 7 years ago

I wish it were. That is in the troubleshooting steps recommended in the wireless-troubleshooting doc. Unfortunately adding iwlmvm to /rw/config/suspend-module-blacklist made made no difference for me in testing.

danjeffery commented 7 years ago

I'm going to test downgrading to 3.2 and rolling back the kernel to see if that corrects it.

z4ppy commented 7 years ago

Hello,

I had the same issue on 3.2, it seems resolved since the last update for me (4.9.35-19.pvops.qubes.x86_64)

2017-08-24 20:29 GMT+02:00 Daniel Jeffery notifications@github.com:

Qubes OS version (e.g., R3.2):

3.2 and 4.0rc1 Affected TemplateVMs (e.g., fedora-23, if applicable):

dom0, sys-net

Expected behavior:

Qubes can be suspended and recover from suspend Actual behavior:

After suspend Qubes is unstable. Behavior is inconsistent. Sometimes networking is just disabled and nmcli in the sys-net system VM reports that the ethernet and wireless devices are unavailable and system is otherwise fine. At other times the sys-net VM is unresponsive or shuts down completely and dom0 gives input/output errors when attempting to open terminals or shutdown. The errors from dom0 are also bizarre as they affect different command from one test to the next. Sometimes lspci will throw the error, other times dmesg, ls or less will throw an error and lspci is fine. Often initctl will give the input/output error and the system must be restarted. Steps to reproduce the behavior:

Suspend qubes (close lid, use menu and echo mem > /etc/sys/suspend all produce the same result) Awaken (lift lid, push power button, it ignores keystrokes on my laptop) General notes:

Hardware is a Lenovo X1 Carbon gen3, wireless adapter is Intel 7265 rev

  1. I've run full system diagnostics on the laptop and it passes. I've tested different nvme drives with no benefit. I've tried the steps from #2922 https://github.com/QubesOS/qubes-issues/issues/2922 and https://www.qubes-os.org/doc/wireless-troubleshooting/# automatically-reloading-drivers-on-suspendresume, but they have not helped. Related issues:

https://groups.google.com/forum/#!topic/qubes-users/LkP-6ORGwME

2922 https://github.com/QubesOS/qubes-issues/issues/2922

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/QubesOS/qubes-issues/issues/3049, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXMtA7GjaRAoZjCez8_qYX6LsLgTbsZks5sbcEjgaJpZM4PBwtf .

danjeffery commented 7 years ago

@z4ppy, unfortunately I didn't have it until I updated to 4.9.35-19 :(

rtiangha commented 7 years ago

@danjeff: Yes, you need to restart sys-net once you've modified the blacklist file. You also need to make sure to blacklist both iwlmvm and iwlwifi in the file, not just iwlmvm.

danjeffery commented 7 years ago

I blacklisted both iwlmvm and iwlwifi. I restarted Qubes entirely. The problem persisted. Just to be sure I am not remembering incorrectly or making some mistake, I will try it again with the latest kernel.

I have reinstalled 3.2 fresh with it running 4.4.14-11 and there is no problem. Suspend works fine and I do not get the odd lockups or input/output errors. I'll update now to latest 4.9.35-19 and fedora 25 for the template and put in the blacklist lines.

danjeffery commented 7 years ago

It looks like I spoke too soon. The sys-net VM had not crashed after the suspend on the older kernel and NetworkManager reported the wifi connection was still up, but it wasn't passing any traffic and once I downed the connection nmcli couldn't bring it up again.

I've blacklisted iwlwifi and iwlmvm in sys-net:/rw/config/suspend-module-blacklist and restarted the sys-net VM. The behavior from the previous paragraph was exactly repeated.

A key difference worth noting in the behavior on the older kernel is that Network Manager still thinks the connection is active and the device is connected. Also, I don't seem to be getting the bizarre behavior in dom0. Also, restarting the sys-net VM is possible and everything works again afterward.

I'm going to proceed to update to fedora-25 and the newer kernel.

danjeffery commented 7 years ago

Okay, on 3.2 with kernel 4.9.35-19 and the fedora 25 template, I am currently seeing the same suspend behavior as on 4.4.14-11 with fedora 23. sys-net:/rw/config/suspend-module-blacklist contains iwlmvm and iwlwifi, each on their own line.

Since I don't seem to be getting the input/output errors anymore (for no reason?) on 3.2 I guess I'll stay here for now and just not suspend. I am very open to other ideas or troubleshooting.

danjeffery commented 7 years ago

Hooked back up my USB mouse and found the sys-net VM has the same problems. After suspend, USB is also broken until the VM is restarted.

And now the input/output errors are back in dom0 and I can't stop and restart the VMs :)

I'm not sure if that is the result of just running it long enough or because I restarted the VMs and suspended a second time without a reboot, but they're back. I'm really wondering if this is a hardware failure at this point, but all the Lenovo system diagnostics come back fine.

rtiangha commented 7 years ago

I don't know; these symptoms are weird. I have an Intel 7260 dual ac card and it seems to work fine, although it's not a 7265 but one would think it was close enough. But it's also on a Dell L502X.

I noticed in your log output on the mail list that it couldn't load the wifi firmware. Just to double check, but is it actually installed (I assume it is, but you never know)?

sudo dnf install iwl7260-firmware or sudo dnf install linux-firmware (I'm not sure which; I'm a Debian guy)

Also, check the Lenovo website for any BIOS updates and if they exist, try applying them. Maybe this is a known hardware issue that's already been fixed and there are a few cases out there with similar symptoms on other distros and they all seem to come from Lenovo users so maybe this is something the manufacturer has already addressed in a BIOS update.

I'd also go into your BIOS and double check any ACPI, Power Management, and Virtualization settings and ensure that they are all enabled properly.

danjeffery commented 7 years ago

It is weird. BIOS is a good point and I had upgraded the BIOS to latest right at the beginning of troubleshooting. The iwl7260 firmware is installed correctly by default. I don't know that the wireless or the USB layers are the right place to look at this point, though. This seems to be an issue with PCI passthrough after suspend or some other events that happen with time since I've had the issue start just after the machine has been running for a while.

The most frustrating part of this is the inconsistency. I have not had this problem since updating to 3.2 as soon as it released last year, but now it's present even on fresh 3.2 install. The problem seems to have started about 2 weeks ago.

I was able to get a hold of another identical (checked all the hardware chips and revisions) gen 3 X1 Carbon and compare against 3 more identical machines running Qubes, but not in my possession. The two in my possession that I have wiped and tested on Qubes 4, 3.2 as-installed and 3.2 up-to-date all exhibit the same behavior. Two of the other 3 seem to also exhibit the suspend behavior, but not the freezes while the 3rd is reported to be fine and is fully patched like the others.

I went ahead and booted up a live disk of Kali and there seem to be no issues. Suspend works fine and there are no unusual freezes or input/output errors. I'm at a bit of loss where to even look next, but at this point I can only use Qubes if I don't let it suspend and even then, it has repeatedly locked up on me and lost work in progress.

If it will help, I am perfectly willing to overnight or 2-day one of these laptops to a dev to help sort this out.

rtiangha commented 7 years ago

As a last resort, maybe try one of @fepitre's 4.12 kernels that he posted to the mail list to see if it helps? The kernel options shouldn't be much different than 4.9's except for the new drivers introduced, but maybe the power management stuff is better. It seems it's kind of flakey in 4.9, especially when it comes to Intel wifi; having Intel power management disabled by default in the kernel makes suspend work for some cards and not others, and enabling it in the kernel flips it around (currently, it's disabled in-kernel because having it enabled was causing too many issues, but there's a sysctl or kernel value you can toggle to enable it yourself, but I don't know it off the top of my head).

https://sourceforge.net/projects/qubes-linux-kernel/files/

danjeffery commented 7 years ago

Well, I thought maybe I'd gotten somewhere over the weekend as I reinstalled 3.2 again and ... everything worked. I suspended and restarted several times and everything seemed fine, so I crossed my fingers and updated the kernel and the template VM, but then it went back to locked up PCI devices and input/output errors. I tried setting everything to use the old kernel, 4.4.14-11, but the problems persisted. I tried Reg's suggestion of using the 4.12 kernels and still had the same problems.

My conclusion at this point is that while the kernel may be involved, it's not just the kernel that's the problem. I'm going to try reinstalling 3.2 fresh, again, and see if I can get the state where everything works. I have done a fresh install of 3.2 about 5 times in the last several days and only had it work correctly that once, so I'm not optimistic it will work, but I am at a loss as to why it would be different. BIOS settings are the same, install parameters are the same.

One issue I don't think I've mentioned is that when this was working correctly, the system can shutdown and restart successfully. When it's in the bad configuration it will hang on shutdown. In the bad configuration before anything appears to be broken (network and usb still working fine, no input/output errors) if I attempt to shutdown or restart I can watch it attempting 8 different background processes, all of which appear to be dismounts until it hits the 1m30s limit and then hang with a series of errors like device-mapper: remove ioctl on [device] failed: Device or resource busy. At this point I'm forced to hard power off by holding down the power button.

Once I've done a suspend or waited long enough for input/output errors and/or net and usb errors, when I attempt to restart I get blk_update_request: I/O error, dev sda sector [some number that changes each time]. The second error made me wonder about bad nvme, but swapping it didn't change anything.

rtiangha commented 7 years ago

So just to confirm, installing fresh from ISO works, updating system afterwards doesn't, and switching to an older kernel from that point still doesn't?

If you're going to re-install from scratch, can you a) capture dmesg output from both dom0 and sys-net and/or attach system logs when it's working, then just update kernel, kernel-qubes-vm (and kernel-devel if you have it) in dom0 (sudo qubes-dom0-update kernel kernel-qubes-vm) and try it again with the new kernel and report back if it still works or not (capture dmesg if it doesn't)? And if it does work, update dom0 and sys-net's template with the regular system updates and try again?

rtiangha commented 7 years ago

Also, at each step, verify the running kernel in both dom0 and sys-net by running uname -r as a sanity check.

danjeffery commented 7 years ago

I changed the title to better reflect this does not appear to be primarily about the network. Both USB and Network VMs lose their devices and dom0 is having issues even running dmesg and sometimes lspci or reading logs.

@rtiangha It doesn't always work fresh from the ISO. I reinstalled 3.2 at least 5 times over the last week and only once did it work correctly. It's reinstalling right now. If it works, I'll collect the logs, update just the kernel and kernel-qubes-vm packages on dom0 and see what that gets us. As noted, I probably can't capture dmesg when it's not working as that command throws the input/output error nearly all the time once we're in the bad state (as well as trying to less/vi/grep anything in /var/log). I've been uname'ing for exactly that reason all along the way. :) Thanks for your help.

rtiangha commented 7 years ago

Cool. Keep the post updated. Full logs where possible would be helpful to at least see what's going on. Personally, I've never seen this behaviour ever.

Also, am I correct in thinking that you've got sys-net acting as a combined USBvm as well, or do you have a separate sys-usb VM?

danjeffery commented 7 years ago

For all the tests over the last week I've had separate sys-usb and sys-net VMs. Are there any logs other than dmesg you'd like me to capture?

danjeffery commented 7 years ago

The thing driving me nuts about this behavior is how inconsistent it's being. I'm trying to hold my install and config parameters totally consistent and think of anything I or the hardware are doing that could give different results, but I'm at a bit of a loss. One difference I just thought of between the two machines I'm testing with and the other three also running Qubes is the BIOS revision. These two are fully patched and I'm not sure if the other three have ever been patched from what the factory shipped.

rtiangha commented 7 years ago

Any kind of system logs would be useful. Do it in dom0, sys-net, and sys-usb. We're still in fact-finding mode.

Also, what kind of BIOS options does the machine have when it comes to Virtualization? On the surface, this sounds like something buggy with VT-d or maybe IOMMU.

Finally, you say you have a set of 2 machines with an updated BIOS and a set of 3 that doesn't? When this stuff works, on what set of machines does it actually work on? And what versions are they running?

danjeffery commented 7 years ago

In BIOS there is an option for Intel (R) Virtualization Technology and Intel (R) VT-d Feature. Both are enabled.

Of the 3 other machines, 1 seems to be fine and having no issues. One seems to be exhibiting the same issues as the two in my possession on suspend, but doesn't seem to develop problems simply by being on for some period of time. The third is having occasionally freezes that require a hard power off (this is the behavior I was seeing before the network lockups and input/output errors started), but the third machine seems fine recovering from the suspend. I've confirmed that all three of those machines are running the same BIOS revision.

danjeffery commented 7 years ago

I'm having the problem immediately on the fresh install. I did the exact same fresh install twice on Friday. The first time was like this, with the problem behavior occurring immediately, the second time it was working fine until I updated the kernel. Today's fresh install is having the problems immediately.

Fortunately, this time it let me run dmesg while sys-net and sys-usb were unable to reach the hardware dedicated to them. I've attached the tarred up dmesg files.

dmesg-logs-bad-config-fresh-3.2.tar.gz

danjeffery commented 7 years ago

Reinstalled 3.2 two more times on the same hardware and it has had the broken suspend right out of the install both times.

rtiangha commented 7 years ago

So are you saying that 1 machine out of the 5 works no matter what you do to it, and the others don't? What BIOS version are all of these running?

danjeffery commented 7 years ago

/sigh

I was attempting to just use one of the two systems I've been reinstalling and not suspend. So, I wrote a full response in firefox on the personal VM, then Qubes froze and I lost it. So, starting again.

All are identical gen 3 Lenovo X1 Carbons purchased as a single batch and running Qubes since late 2015.

In use by others: 1 of 5: BIOS 1.10, identical BIOS settings to my broken boxes. Qubes 3.2. Fedora 25 template VMs. No issues. 2 of 5: BIOS 1.10, BIOS settings not verified. Qubes 3.2. Fedora 25 template VMs. net/usb pci devices unavailable in VMs after a suspend (usually, works fine ~10% of the time). 3 of 5: BIOS 1.10, BIOS settings not verified. Qubes 3.2. Fedora 25 template VMs. Occasional OS freezes (every 3-4 days?). No suspend issues.

In my possession: 4 of 5: BIOS 1.17, identical BIOS settings to 1 of 5. Qubes 3.2 (also tested 4). Fedora 23 or 25 template VMs. PCI devices unavailable in VMs after suspend. If actually running firefox in one VM, it freezes after about 30 minutes. Reinstalled 4 twice and 3.2 3 times in the last week. 5 of 5: BIOS 1.17, identical BIOS settings to 1 of 5. Qubes 3.2 (also tested 4). Fedora 23 or 25 template VMs. PCI devices unavailable in VMs after suspend. If actually running firefox in one VM, it freezes after about 30 minutes. Installed and tested 4, reinstalled fresh 3.2 5 times in the last 4 days, worked fine on one install until kernel was updated.

Looking at this, I'll try rolling the BIOS back to 1.10. The problems 2 and 3 are having are not as consistent or comprehensive as the problems 4 and 5 are having. I know there are some rollback prevention BIOS settings I'll probably need to play with.

rtiangha commented 7 years ago

Or rather than rolling back to 1.10, figure out what the BIOS settings of Number 3 is and copy it to the machines that aren't working. If Number 3 is the one that's working most of the time, you need to figure out what makes it different.

danjeffery commented 7 years ago

1 of 5 is the one with no issues. 3 has freezing issues fairly regularly. 4 and 5 have their BIOS settings set identical to 1's.

rtiangha commented 7 years ago

But your message said "3 of 5: BIOS 1.10, BIOS settings not verified. Qubes 3.2. Fedora 25 template VMs. Occasional OS freezes (every 3-4 days?). No suspend issues."???

rtiangha commented 7 years ago

Edit: Oops, my bad. I misread.

danjeffery commented 7 years ago

Correct. 3 is not having issuing coming out of suspend, but it freezes up completely roughly every 72 hours.

danjeffery commented 7 years ago

Ugh, actually I realized that is incomplete. 3 of 5 is running primarily Debian user VMs, instead of Fedora. I believe the system VMs are updated to Fedora 25, but the user VMs were Debian 8 and they upgraded to 9 to see if it would fix things, but it didn't help.

rtiangha commented 7 years ago

Well, maybe downgrading to 1.10 might help. There's also a 1.18 BIOS update that was released on Aug 22. Might be worth trying?

danjeffery commented 7 years ago

Okay, tentative progress, maybe, if it sticks, I hope. I went to download the 1.10... hehe. You found it too :)

I just installed 1.18 on 5 of 5 and restarted. Did not kill PCI coming out of suspend. I did nothing else to the box and it was reliably locking up immediately before this. So, probably a very good sign. Doing some more testing to see if this continues, then I'll update the kernel and see what happens.

rtiangha commented 7 years ago

Well, I was reading through changelogs and it looks like you won't easily be able to downgrade past 1.14. But maybe Lenovo is aware of the issues and hopefully 1.18+ will work better.

danjeffery commented 7 years ago

I was afraid that might be the case based on some of the BIOS settings I'd been reviewing.

So, I'm seeing I need to control the method used to put the machine into suspend and test variations. For the last several days I have pretty consistently been using the method described in another ticket of echo mem > /sys/power/state in dom0. This still produces an unstable system even after updating to BIOS 1.18.

However, if I use the menu > logout > suspend route or close the lid, system 4 of 5 (currently fresh 3.2, 4.4.14-11 kernel, fedora 23 templates) suspends and recovers just fine. I tested this suspend route on 5 of 5 (currently running 4.9.35-19 kernel, all other dom0 updates applied and fedora 25 templates) and it failed to recover from suspend at all, simply freezing up the window manager and returning a blank screen, although other tty sessions (ctrl+alt+f2..f6) were accessible. I updated to BIOS 1.18 and the behavior remained the same.

At this point I've updated ONLY kernel, kernel-qubes-vm and perl-math-bigint (dependency) on 4 of 5 and rebooted. Immediately, recovering from suspend broke the wifi, the device was unavailable in the sys-net vm, but USB was fine. So I added iwlwifi and iwlmvm to /rw/config/suspend-module-blacklist and restarted the sys-net vm but this did not help. On returning from suspend after this change I began to receive the block I/O errors and a lot of read only file system errors as dm-1 had been reloaded ro.

I rebooted and tried again with the suspend-module-blacklist in place and now 4 of 5 is behaving like 5 of 5 and displaying only an unresponsive black screen instead of the xscreensaver login prompt. So, from working to broken, by updating only the kernel. I'm going to try setting EFI to use the older kernel on 4 of 5 and 5 of 5 and see what happens then.

danjeffery commented 7 years ago

And that seems to nail it. On 4 of 5 nothing but the kernel was updated and the system chokes on suspend, but reverting the kernel fixes the issue. On 5 of 5 everything was updated and all VM templates are Fedora 25. It chokes after suspend, but reverting to using the 4.4.14-11 kernel allows suspend to work fine, so the problem is definitely tied to the new kernel.

Again, this is suspending by using the app menu, not by echo mem > /sys/power/state which seems to always cause a problem.

danjeffery commented 7 years ago

I tried the 4.12.8-20 kernel again on 5 and it gets stuck in endless reboot.

So, we seem to have an interaction between BIOS and the post-4.4 kernels. Latest BIOS from Lenovo makes it possible to use the 4.4 kernel without the major problems I was seeing on 4.9, but I'm now running an older, unpatched kernel.

rtiangha commented 7 years ago

And what about the 4.9 kernel on BIOS 1.18?

danjeffery commented 7 years ago

The previous two comments to the one on 4.12 were about comparing the 4.9 and 4.4 kernels. It will boot on 4.9, but consistently and reliably breaks after suspend and freezes after about a half hour of using the system. The breaking on suspend results in the I/O errors, inability to restart itself, some filesystems going read only and PCI devices being unavailable to the VMs they were assigned to.

danjeffery commented 7 years ago

BTW, Reg, thanks for jumping in and helping me troubleshoot this. It's very appreciated.

rtiangha commented 7 years ago

Well, I'm going to have to dig deep into the kernel config options between 4.4 and 4.9 to figure out if there's anything different.

In the meantime, you can try to compile the 4.4 kernel off of my branch here:

https://github.com/rtiangha/qubes-linux-kernel/tree/stable-4.4

That will get you the latest 4.4 version that was released last week. If the problem still persists, it might be an upstream problem.

rtiangha commented 7 years ago

Also, just curious: Does at least blacklisting the wifi modules work properly in BIOS 1.18?

rtiangha commented 7 years ago

Finally, just shooting from the hip, but did you run ME cleaner on your Thinkpad?

rtiangha commented 7 years ago

Also, looks like the 4.9.45 kernel is slowly making its way to current-testing. I'm doubtful it'll solve things in this case, but you never know. But yeah, if you can compile a newer 4.4 kernel and report back, it'll help with the troubleshooting. In my head, there might be a few more things to try, but it'll require changing some kernel options to narrow things down.

danjeffery commented 7 years ago

Interesting thought, but, no, ME cleaner has not been run. In BIOS there is an option to permanently disable AMT and that has been done. I'll give compiling the newer 4.4 kernel a shot.

I'm not sure how to define 'works properly' where using suspend-module-blacklist is concerned. I have put the entries in, but can't see any change in the system performance regardless of whether they are present or not. I was considering also adding cfg80211 and mac80211 to see if that made it possible for the wireless to recover. That said, I really don't think that's where the problem lies as none of these systems have needed that set in the past and the USB becoming unavailable, plus the Input/Output errors and dom0 filesystems going read only seem to point to a problem lower level and more pervasive than a simple suspend/resume power management issue with iwl.

rtiangha commented 7 years ago

By "works properly" I meant if the wifi works after suspend. There could be different issues at play here.

It's the "30 minutes" that interests me. If it's always 30 minutes before it breaks, it might be related to the ME (but again, it's a shot in the dark). The newer kernel versions have the ME driver compiled out of it (for obvious reasons, including the fact that AMT makes no sense in a Qubes context), but I wonder if for machines like yours, it's actually needed, or if machines were provisioned for AMT in the past (some manufacturers do that before shipping), maybe the driver is needed as well. That's why if you can compile your own kernels, it'd help with the testing since we can toggle on and off various options to see what works.

danjeffery commented 7 years ago

I'm game for trying some different compile options. Theoretically AMT is permanently disabled as the BIOS option is asserted to be unrecoverable.

I can't say it's exactly 30 minutes, just around that long, and does seem dependent on the system actually being in use, not just being turned on.

danjeffery commented 7 years ago

I got 4.4.84-14 compiled and installed and it seems to be working well so far.