Open carlfriedrich opened 9 months ago
Installed v5.4-4535-g9a3d7fd275b on my laptop this morning and hibernated it while traveling to my office. After about an hour of use after coming out of hibernation, I hit the unresponsive/high-CPU-usage issue and needed to kill WSL service to recover.
@unwiredben Thanks a lot for testing it out, that is really helpful! Can you also check if v5.4 is working for you?
@unwiredben Thanks a lot for testing it out, that is really helpful! Can you also check if v5.4 is working for you?
I just switched over to 5.4 and will report back in a few days unless I see if hang first.
@carlfriedrich, nice setup you have here! :) I wish I could contribute more now, however a Win10 update a month ago broke my hibernation at all so it now almost always acts as a regular shutdown..
@carlfriedrich, nice setup you have here! :) I wish I could contribute more now, however a Win10 update a month ago broke my hibernation at all so it now almost always acts as a regular shutdown..
Well, then one might say the update fixed the issue for you. 😋
So far, no hangs with 5.4 across three hibernate cycles.
Which CPU do you guys use, AMD or INTEL? The wsl kernel versions of the computers at my company and at home are the same, both are the latest official versions of WSL. The computer at the company has not had a CPU 100% issue for a long time, but the computer at home still frequently encounters this problem. The computer at the company uses an INTEL CPU, while the one at my home uses an AMD CPU.
switched to 5.4 today I will come back to provide feedback in a while.
Which CPU do you guys use, AMD or INTEL? The wsl kernel versions of the computers at my company and at home are the same, both are the latest official versions of WSL. The computer at the company has not had a CPU 100% issue for a long time, but the computer at home still frequently encounters this problem. The computer at the company uses an INTEL CPU, while the one at my home uses an AMD CPU.
@mannfuri Thanks for your feedback. That's quite interesting, actually. I am on Intel on both my work and my home machine, and I get the issue on both. So AMD vs. Intel does not seem to be responsible for the issue to appear. I remember someone reporting in the upstream issue, that they also get the issue on ARM. There must be some component, though, which makes a difference. According to the comments from Microsoft in the upstream issue, they weren't able to reproduce the issue in any of their environments. So that's why we - the affected users - are trying to find the bad kernel commit here. We hope that this gives Microsoft a hint where to look at, and maybe we also find out why it happens only on some machines. Hence I very appreciate that you join our testing. Thanks a lot!
I've just had the usual hang with the current 5.15 kernel version today. I'm keen to help with this effort and have switched to 5.4.0 just now. I'll give that a few days before moving on to v5.4-4535
@tobyvinnell Great, thanks a lot for your help!
Still no freezing with 5.4. Just to add to the platform discussion, I'm using a Dell Latitude 7430 with an Intel i7-1270P.
For kernel v5.4-4535-g9a3d7fd275b, I get the following error message when trying to start WSL:
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. Error code: Wsl/Service/0x8007274c Press any key to continue...
Anyone else facing the same issue?
@aquohn I might have seen something similar, but in my case it worked the second time I tried to start WSL. Is this reproducible for you?
@aquohn Just checked again: Yes, I get the same message, but calling wsl
a second time works for me. Seems like WSL needs more time to boot with this kernel version.
FYI: I have been running v5.4 for over a week now on both my work and home machine without any hangs, and since nobody else reported a hang so far, I am marking it as "good" in the issue description. I also added a column with the number of good/bad reports for each kernel version, just to keep track of on how much feedback we based the decision. So please keep reporting your experiences, even if we already have marked a version as "good" or "bad".
I will switch to v5.4-4535-g9a3d7fd275b now.
This is like the higgs boson search 🙂
@mungojam I am quite optimistic that we will need less than 40 years for this. :-)
Hallo. I've been tracking the Interrupt storm issue for a while now. Due to some unrelated stuff, I needed to reinstall my distro and do a complete setup from scratch. Since I needed complete systemd to have proper lvm mounting on boot I installed XanMod Kernel - 5 days+ no issues with hangs and CPU usage.
Would any of you be willing to give it a test run for a couple of days?
I think it would be revelant to see if I hit some weird perfect storm of settings which doesn't cause the issue, or if this kernel is stable :)
@seebeen Interesting project, haven't heard of that before. We're trying to bisect to a certain commit here, though, so while trying some other kernel images might be interesting in general, it will not help with the progress of this work.
Hi @carlfriedrich, sometimes hibernate does work for me. Last time after successful return from hibernation WSL with v5.4-4535-g9a3d7fd275b kernel has hanged.
Somebody in the other thread observed that windows sometimes seems to start in an immune state, and other times not, so make sure you are restarting as well as hibernating when testing a given version.
Sorry I haven't got the space to help with this search.
That was me, haha :) Actually, more thorough testing (with restarts) is needed when WSL is not hanging and seems to be working good (as there were many reports (including myself) where at first is thought that some update, kernel version or something else solves issue and later appears that it is not). But of course, overall - more testing is better
@carlfriedrich Unfortunately even with four consecutive wsl
commands, the kernel is still not able to boot, and I get the same error message. My .wslconfig is empty except for the
[wsl2]
kernel=...
lines. However, with the 5.4 kernel, I boot on the second wsl
command.
@onereal7 Thanks for your feedback!
We have two reports who had the issue with v5.4-4535-g9a3d7fd275b now, so I marked it "bad" in the issue description and continued the bisection.
Next test candidate is v5.4-2622-g386403a115f. I just switched to this version and will test it through the next days.
@aquohn Can you check if your boot issue also appears with this version? I encountered it again as well for like 2-3 times when trying to boot the new candidate, but on the next try it worked. Don't know why this happens, though.
@ everyone: please keep reporting your experiences with all prvious versions as well. The more data we have, the better.
I'll switch to that shortly. I never had any suspend issue with 5.4.0 but did have a problem using Docker Desktop with it because a /proc/sys/vm/compaction_proactiveness was missing on that build. Will check to see when that setting was enabled for current WSL kernels.
I can't start v5.4-4535-g9a3d7fd275b either. v5.4 seems to be working fine for a few days. I am now switching to v5.4-2622-g386403a115f. v5.4-2622-g386403a115f can be started instantly
So in the upstream issue microsoft/WSL#6982 @MrSnoozles reported that he had the kernel hang with v5.4. If that is true, we might be searching for the issue on the wrong branch. I will stay on v5.4 on my work PC for a few weeks in order to verify this.
I have tested the 5.4.0 kernel for about a week now and I have not seen any freeze.
$ uname --kernel-release
5.4.0-microsoft-standard-WSL2
@carlfriedrich, so far so good with v5.4-2622-g386403a115f - no hangs after multiple hibernations with a few restarts inbetween.
Hallo. I've been tracking the Interrupt storm issue for a while now. Due to some unrelated stuff, I needed to reinstall my distro and do a complete setup from scratch. Since I needed complete systemd to have proper lvm mounting on boot I installed XanMod Kernel - 5 days+ no issues with hangs and CPU usage.
Would any of you be willing to give it a test run for a couple of days? I think it would be revelant to see if I hit some weird perfect storm of settings which doesn't cause the issue, or if this kernel is stable :)
Hi. I installed it and will report back if the hang occurs again.
For kernel v5.4-4535-g9a3d7fd275b, I get the following error message when trying to start WSL:
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. Error code: Wsl/Service/0x8007274c Press any key to continue...
Anyone else facing the same issue?
I had this issue too with all the kernels but I've found that renaming them to a shorter name like bzImage-5.4.0-04535 fixed it. Although that might just be co-incidence and it just takes a few goes to get it to work.
I've had no hangs on 5.4.0 for two weeks and would normally have had several by now. I've now just switched to 5.4.0-microsoft-standard-WSL2-02620-g386403a115f
FYI, I have recently upgraded my Intel i7 10700F to i9 11900KF (on the same rig having 64GB RAM on a Gigabyte Z490 Aorus Elite AC Mobo) and the problem started as soon as I hibernated with the new CPU.
I was using hibernate on the 10700 for ~6months daily, and I never had the Vmmem 100% CPU issue before. The problems clearly started with 11900. Note that I have a have a Noctua D15 with twin fans and I rarely see temps over 70-75c on full usage. Also every other stress test I tried succeeded.
I'm on Kernel Kernel version: 5.15.133.1-1 and I just updated to WSL version: 2.0.14.0 from 2.0.9.0. I will test some more on that and then swap back to 10700F and report back if I reach any conclusions.
UPDATE:
Which CPU do you guys use, AMD or INTEL?
@mannfuri, I also use AMD (Ryzen 7 1700, desktop pc).
FYI, I have recently upgraded my Intel i7 10700F to i9 11900KF (on the same rig having 64GB RAM on a Gigabyte Z490 Aorus Elite Mobo) and the problem started as soon as I hibernated with the new CPU.
@anodynos, yeah, might be something related to hardware. Did the hangs start after WSL update to 2.0.14.0 or also before (but after upgrading CPU)
Thanks so much for leading this effort!! ❤️
Are we sure that a kernel commit is to blame for this problem? Last promising response I saw on the topic from @kelleymh seemed to suggest that this could be a bug in Hyper-V https://github.com/microsoft/WSL/issues/6982#issuecomment-1727922357
@anodynos, yeah, might be something related to hardware. Did the hangs start after WSL update to 2.0.14.0 or also before (but after upgrading CPU)
Hey @onereal7 - I updated the original comment, copying here:
I will update if I have more info, now I'm going to return the 11900KF cause it's driving me nuts (shame really, it was the best easy upgrade I could do in my rig).
Are we sure that a kernel commit is to blame for this problem? Last promising response I saw on the topic from @kelleymh seemed to suggest that this could be a bug in Hyper-V microsoft/WSL#6982 (comment)
@elkrammer No, we are not sure, that's why it says "Trying..." in the issue title. There's lots of reported user experience which make this appear quite likely, though. All I know from the comments (and I have read EVERY comment in the upstream issue) is that replacing the kernel to an older version is the ONLY thing that consistently fixed the issue for several people.
FYI, I have recently upgraded my Intel i7 10700F to i9 11900KF (on the same rig having 64GB RAM on a Gigabyte Z490 Aorus Elite AC Mobo) and the problem started as soon as I hibernated with the new CPU.
@anodynos This is definitely interesting. It is not relevant for this issue's progress, though, so I would like everyone to post these kind of news in the upstream issue in order to stay focused here. I am not a Microsoft developer, so I am not debugging or collecting any information related to the issue. We're just trying to find the kernel commit which introduced the issue, which can be ONE hint to the problem. For all other hints, please use the upstream issue's comment section. I will put this note into the issue description as well. Thanks for your understanding.
So in the upstream issue microsoft/WSL#6982 @MrSnoozles reported that he had the kernel hang with v5.4. If that is true, we might be searching for the issue on the wrong branch. I will stay on v5.4 on my work PC for a few weeks in order to verify this.
@carlfriedrich Swap was disabled in my WSL-Ubuntu. Since I noticed and enabled it again I didn't have any problem with 5.4 so I think the 5.4 branch is fine.
Hallo. I've been tracking the Interrupt storm issue for a while now. Due to some unrelated stuff, I needed to reinstall my distro and do a complete setup from scratch. Since I needed complete systemd to have proper lvm mounting on boot I installed XanMod Kernel - 5 days+ no issues with hangs and CPU usage. Would any of you be willing to give it a test run for a couple of days? I think it would be revelant to see if I hit some weird perfect storm of settings which doesn't cause the issue, or if this kernel is stable :)
Hi. I installed it and will report back if the hang occurs again.
The kernel did not help. WSL still locks and uses 100%cpu. For me it was triggered when I removed the power adapter from my laptop and left it on battery mode. Will try to reproduce and get back to you with updates.
@carlfriedrich Swap was disabled in my WSL-Ubuntu. Since I noticed and enabled it again I didn't have any problem with 5.4 so I think the 5.4 branch is fine.
@MrSnoozles Thanks for reporting back! I am removing your negative vote for v5.4 then.
FYI, I have recently upgraded my Intel i7 10700F to i9 11900KF (on the same rig having 64GB RAM on a Gigabyte Z490 Aorus Elite AC Mobo) and the problem started as soon as I hibernated with the new CPU.
@anodynos This is definitely interesting. It is not relevant for this issue's progress, though, so I would like everyone to post these kind of news in the upstream issue in order to stay focused here. I am not a Microsoft developer, so I am not debugging or collecting any information related to the issue. We're just trying to find the kernel commit which introduced the issue, which can be ONE hint to the problem. For all other hints, please use the upstream issue's comment section. I will put this note into the issue description as well. Thanks for your understanding.
UPDATE 2: It's been 2 days with many hibernation cycles & restarts, with heavy workload and no VMmem.exe issues with my old trusted 10700. It was the complete opposite with the 11900KF, despite passing all other stress tests fine: it was becoming unstable every time after hibernation. Its definitely something in the new CPU generation that triggers the bug!
I am on 5.15.133.1-1 and it works fine with WSL 2.0.14.0 & 2.0.9.0 (on the 10700F only)
@anodynos As I already kindly asked you yesterday, please stop posting any environmental information you are observing concerning the issue. It is not of any use here. We are using this issue to organize the kernel bisection and gather feedback about specific kernel images. Everything else should go into the upstream issue. Please move your comments over there.
I've been using the g386403a115f for about two weeks now and i have not seen the 100% CPU issue.
$ uname --kernel-release
5.4.0-microsoft-standard-WSL2-02620-g386403a115f
I've been using this release for a week, and there wasn't any moments with 100 % CPU on VMmem and responsiveness issues. Also there was one time when memory started to creep up to the 3GB limit, but then dropped back to the usual ~1 GB, which never happened on later (5.5+) versions. Using the version: 5.4.0-microsoft-standard-WSL2-02620-g386403a115f
@skoelden @alexeylark Thanks for reporting. I also have been running v5.4-2622-g386403a115f for two weeks without any hangs, so I am marking this version as "good" now.
I continued the bisection, next candidate is v5.4-3434-g3f1b210a7f9. Happy testing!
v5.4-3434-g3f1b210a7f9 hung the first day
My co-worker and I switched back to Kernel 5.4.91 and we can confirm, that the issue is gone.
My co-worker and I switched back to Kernel 5.4.91 and we can confirm, that the issue is gone.
@benzman81, that's great to have more confirmations that we are on a right path, however that does not help for intention of this task. Could you please also contribute by testing one of the kernels provided here, currently it is v5.4-3434-g3f1b210a7f9
@onereal7 didnt you already post, that v5.4-3434-g3f1b210a7f9 already hung? We need to change some config in the kernel for our case, so we cannot use prebuild kernels.
We're trying to find the kernel commit which makes WSL non-responsive after hibernation, which is described in the issues microsoft/WSL#8696 and microsoft/WSL#6982.
Our starting point
Bisecting the kernel
We have about 13,000 commits between v5.4 and v5.5-rc1. Using
git bisect
we should be able to track down the commit introducing the issue within 14 rounds. As a start, I have built the start and end versions and one in between. I will update this table as soon as the versions are confirmed to be working or non-working and add new versions as I continue the bisection. The links in the table lead to the release page for the corresponding version where you can download the kernel image.How you can help
uname --kernel-release
in your comment.I will wait for a reasonable number of reports for each version, so even if somebody else reported a working or non-working version before, please do report your experience as well.
How you cannot help
We're not looking for any workarounds or environment information related to the issue here. I am not a Microsoft developer, so I am not debugging the issue or collecting any information to help solving it. If you want to share any information of this kind, please do so in one of the upstream issues.
Thanks a lot for your help in advance. 💚
Update
We have found the kernel commit introducing the issue:
Merge commit: microsoft/WSL2-Linux-Kernel@64d6a12094f3
Atomic commit: microsoft/WSL2-Linux-Kernel@dce7cd62754b5
From here on I will try to build more recent kernel versions with the commit reverted. Feel free to use these and report your experience.