carlfriedrich / wsl-kernel-build

19 stars 0 forks source link

Trying to find the kernel commit which makes WSL non-responsive #1

Open carlfriedrich opened 9 months ago

carlfriedrich commented 9 months ago

We're trying to find the kernel commit which makes WSL non-responsive after hibernation, which is described in the issues microsoft/WSL#8696 and microsoft/WSL#6982.

Our starting point

Bisecting the kernel

We have about 13,000 commits between v5.4 and v5.5-rc1. Using git bisect we should be able to track down the commit introducing the issue within 14 rounds. As a start, I have built the start and end versions and one in between. I will update this table as soon as the versions are confirmed to be working or non-working and add new versions as I continue the bisection. The links in the table lead to the release page for the corresponding version where you can download the kernel image.

Kernel version Good Reports good / bad
v5.4 6 / 0
v5.4-2622-g386403a115f 5 / 0
v5.4-2759-ga86f69d3349 8 / 0
v5.4-2809-ga25bbc2644f 4 / 0
v5.4-2816-gcd4771f7709 5 / 0
v5.4-2819-g64d6a12094f 0 / 3
v5.4-2824-g24ee25a6da8 1 / 4
v5.4-2841-gda42761df5c 1 / 4
v5.4-2929-g1d87200446f 0 / 4
v5.4-3127-g77a05940eee 0 / 2
v5.4-3434-g3f1b210a7f9 0 / 2
v5.4-4535-g9a3d7fd275b 0 / 2
v5.5-rc1 0 / 2

How you can help

I will wait for a reasonable number of reports for each version, so even if somebody else reported a working or non-working version before, please do report your experience as well.

How you cannot help

We're not looking for any workarounds or environment information related to the issue here. I am not a Microsoft developer, so I am not debugging the issue or collecting any information to help solving it. If you want to share any information of this kind, please do so in one of the upstream issues.

Thanks a lot for your help in advance. 💚


Update

We have found the kernel commit introducing the issue:

Merge commit: microsoft/WSL2-Linux-Kernel@64d6a12094f3

Atomic commit: microsoft/WSL2-Linux-Kernel@dce7cd62754b5

From here on I will try to build more recent kernel versions with the commit reverted. Feel free to use these and report your experience.

Kernel version Good Reports good / bad Notes
v5.5-rc1-1-g0622e5f6a3 3 / 0 v5.5-rc1 with 64d6a12094f3 reverted
v5.5-rc1-2-g0265cf1764 0 / 3 v5.5-rc1-1-g0622e5f6a3 with dce7cd62754b5 cherry-picked
linux-msft-wsl-5.10.102.2 3 / 0 linux-msft-wsl-5.10.102.1 with dce7cd62754b5 reverted
linux-msft-wsl-5.15.153.2 6 / 0 linux-msft-wsl-5.15.153.1 with dce7cd62754b5 reverted
unwiredben commented 9 months ago

Installed v5.4-4535-g9a3d7fd275b on my laptop this morning and hibernated it while traveling to my office. After about an hour of use after coming out of hibernation, I hit the unresponsive/high-CPU-usage issue and needed to kill WSL service to recover.

carlfriedrich commented 9 months ago

@unwiredben Thanks a lot for testing it out, that is really helpful! Can you also check if v5.4 is working for you?

unwiredben commented 9 months ago

@unwiredben Thanks a lot for testing it out, that is really helpful! Can you also check if v5.4 is working for you?

I just switched over to 5.4 and will report back in a few days unless I see if hang first.

onereal7 commented 9 months ago

@carlfriedrich, nice setup you have here! :) I wish I could contribute more now, however a Win10 update a month ago broke my hibernation at all so it now almost always acts as a regular shutdown..

carlfriedrich commented 9 months ago

@carlfriedrich, nice setup you have here! :) I wish I could contribute more now, however a Win10 update a month ago broke my hibernation at all so it now almost always acts as a regular shutdown..

Well, then one might say the update fixed the issue for you. 😋

unwiredben commented 9 months ago

So far, no hangs with 5.4 across three hibernate cycles.

mannfuri commented 9 months ago

Which CPU do you guys use, AMD or INTEL? The wsl kernel versions of the computers at my company and at home are the same, both are the latest official versions of WSL. The computer at the company has not had a CPU 100% issue for a long time, but the computer at home still frequently encounters this problem. The computer at the company uses an INTEL CPU, while the one at my home uses an AMD CPU.

mannfuri commented 9 months ago

switched to 5.4 today I will come back to provide feedback in a while.

carlfriedrich commented 9 months ago

Which CPU do you guys use, AMD or INTEL? The wsl kernel versions of the computers at my company and at home are the same, both are the latest official versions of WSL. The computer at the company has not had a CPU 100% issue for a long time, but the computer at home still frequently encounters this problem. The computer at the company uses an INTEL CPU, while the one at my home uses an AMD CPU.

@mannfuri Thanks for your feedback. That's quite interesting, actually. I am on Intel on both my work and my home machine, and I get the issue on both. So AMD vs. Intel does not seem to be responsible for the issue to appear. I remember someone reporting in the upstream issue, that they also get the issue on ARM. There must be some component, though, which makes a difference. According to the comments from Microsoft in the upstream issue, they weren't able to reproduce the issue in any of their environments. So that's why we - the affected users - are trying to find the bad kernel commit here. We hope that this gives Microsoft a hint where to look at, and maybe we also find out why it happens only on some machines. Hence I very appreciate that you join our testing. Thanks a lot!

tobyvinnell commented 9 months ago

I've just had the usual hang with the current 5.15 kernel version today. I'm keen to help with this effort and have switched to 5.4.0 just now. I'll give that a few days before moving on to v5.4-4535

carlfriedrich commented 9 months ago

@tobyvinnell Great, thanks a lot for your help!

unwiredben commented 9 months ago

Still no freezing with 5.4. Just to add to the platform discussion, I'm using a Dell Latitude 7430 with an Intel i7-1270P.

aquohn commented 9 months ago

For kernel v5.4-4535-g9a3d7fd275b, I get the following error message when trying to start WSL:

A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. Error code: Wsl/Service/0x8007274c Press any key to continue...

Anyone else facing the same issue?

carlfriedrich commented 9 months ago

@aquohn I might have seen something similar, but in my case it worked the second time I tried to start WSL. Is this reproducible for you?

carlfriedrich commented 9 months ago

@aquohn Just checked again: Yes, I get the same message, but calling wsl a second time works for me. Seems like WSL needs more time to boot with this kernel version.

carlfriedrich commented 9 months ago

FYI: I have been running v5.4 for over a week now on both my work and home machine without any hangs, and since nobody else reported a hang so far, I am marking it as "good" in the issue description. I also added a column with the number of good/bad reports for each kernel version, just to keep track of on how much feedback we based the decision. So please keep reporting your experiences, even if we already have marked a version as "good" or "bad".

I will switch to v5.4-4535-g9a3d7fd275b now.

mungojam commented 9 months ago

This is like the higgs boson search 🙂

carlfriedrich commented 9 months ago

@mungojam I am quite optimistic that we will need less than 40 years for this. :-)

seebeen commented 9 months ago

Hallo. I've been tracking the Interrupt storm issue for a while now. Due to some unrelated stuff, I needed to reinstall my distro and do a complete setup from scratch. Since I needed complete systemd to have proper lvm mounting on boot I installed XanMod Kernel - 5 days+ no issues with hangs and CPU usage.

Would any of you be willing to give it a test run for a couple of days?
I think it would be revelant to see if I hit some weird perfect storm of settings which doesn't cause the issue, or if this kernel is stable :)

carlfriedrich commented 9 months ago

@seebeen Interesting project, haven't heard of that before. We're trying to bisect to a certain commit here, though, so while trying some other kernel images might be interesting in general, it will not help with the progress of this work.

onereal7 commented 9 months ago

Hi @carlfriedrich, sometimes hibernate does work for me. Last time after successful return from hibernation WSL with v5.4-4535-g9a3d7fd275b kernel has hanged.

mungojam commented 9 months ago

Somebody in the other thread observed that windows sometimes seems to start in an immune state, and other times not, so make sure you are restarting as well as hibernating when testing a given version.

Sorry I haven't got the space to help with this search.

onereal7 commented 9 months ago

That was me, haha :) Actually, more thorough testing (with restarts) is needed when WSL is not hanging and seems to be working good (as there were many reports (including myself) where at first is thought that some update, kernel version or something else solves issue and later appears that it is not). But of course, overall - more testing is better

aquohn commented 9 months ago

@carlfriedrich Unfortunately even with four consecutive wsl commands, the kernel is still not able to boot, and I get the same error message. My .wslconfig is empty except for the

[wsl2]
kernel=...

lines. However, with the 5.4 kernel, I boot on the second wsl command.

carlfriedrich commented 9 months ago

@onereal7 Thanks for your feedback!

We have two reports who had the issue with v5.4-4535-g9a3d7fd275b now, so I marked it "bad" in the issue description and continued the bisection.

Next test candidate is v5.4-2622-g386403a115f. I just switched to this version and will test it through the next days.

@aquohn Can you check if your boot issue also appears with this version? I encountered it again as well for like 2-3 times when trying to boot the new candidate, but on the next try it worked. Don't know why this happens, though.

@ everyone: please keep reporting your experiences with all prvious versions as well. The more data we have, the better.

unwiredben commented 9 months ago

I'll switch to that shortly. I never had any suspend issue with 5.4.0 but did have a problem using Docker Desktop with it because a /proc/sys/vm/compaction_proactiveness was missing on that build. Will check to see when that setting was enabled for current WSL kernels.

mannfuri commented 9 months ago

I can't start v5.4-4535-g9a3d7fd275b either. v5.4 seems to be working fine for a few days. I am now switching to v5.4-2622-g386403a115f. v5.4-2622-g386403a115f can be started instantly

carlfriedrich commented 9 months ago

So in the upstream issue microsoft/WSL#6982 @MrSnoozles reported that he had the kernel hang with v5.4. If that is true, we might be searching for the issue on the wrong branch. I will stay on v5.4 on my work PC for a few weeks in order to verify this.

skoelden commented 9 months ago

I have tested the 5.4.0 kernel for about a week now and I have not seen any freeze.

$ uname --kernel-release
5.4.0-microsoft-standard-WSL2
onereal7 commented 9 months ago

@carlfriedrich, so far so good with v5.4-2622-g386403a115f - no hangs after multiple hibernations with a few restarts inbetween.

bogdan-radocea commented 9 months ago

Hallo. I've been tracking the Interrupt storm issue for a while now. Due to some unrelated stuff, I needed to reinstall my distro and do a complete setup from scratch. Since I needed complete systemd to have proper lvm mounting on boot I installed XanMod Kernel - 5 days+ no issues with hangs and CPU usage.

Would any of you be willing to give it a test run for a couple of days? I think it would be revelant to see if I hit some weird perfect storm of settings which doesn't cause the issue, or if this kernel is stable :)

Hi. I installed it and will report back if the hang occurs again.

tobyvinnell commented 9 months ago

For kernel v5.4-4535-g9a3d7fd275b, I get the following error message when trying to start WSL:

A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. Error code: Wsl/Service/0x8007274c Press any key to continue...

Anyone else facing the same issue?

I had this issue too with all the kernels but I've found that renaming them to a shorter name like bzImage-5.4.0-04535 fixed it. Although that might just be co-incidence and it just takes a few goes to get it to work.

I've had no hangs on 5.4.0 for two weeks and would normally have had several by now. I've now just switched to 5.4.0-microsoft-standard-WSL2-02620-g386403a115f

anodynos commented 9 months ago

FYI, I have recently upgraded my Intel i7 10700F to i9 11900KF (on the same rig having 64GB RAM on a Gigabyte Z490 Aorus Elite AC Mobo) and the problem started as soon as I hibernated with the new CPU.

I was using hibernate on the 10700 for ~6months daily, and I never had the Vmmem 100% CPU issue before. The problems clearly started with 11900. Note that I have a have a Noctua D15 with twin fans and I rarely see temps over 70-75c on full usage. Also every other stress test I tried succeeded.

I'm on Kernel Kernel version: 5.15.133.1-1 and I just updated to WSL version: 2.0.14.0 from 2.0.9.0. I will test some more on that and then swap back to 10700F and report back if I reach any conclusions.

UPDATE:

onereal7 commented 9 months ago

Which CPU do you guys use, AMD or INTEL?

@mannfuri, I also use AMD (Ryzen 7 1700, desktop pc).

FYI, I have recently upgraded my Intel i7 10700F to i9 11900KF (on the same rig having 64GB RAM on a Gigabyte Z490 Aorus Elite Mobo) and the problem started as soon as I hibernated with the new CPU.

@anodynos, yeah, might be something related to hardware. Did the hangs start after WSL update to 2.0.14.0 or also before (but after upgrading CPU)

elkrammer commented 9 months ago

Thanks so much for leading this effort!! ❤️

Are we sure that a kernel commit is to blame for this problem? Last promising response I saw on the topic from @kelleymh seemed to suggest that this could be a bug in Hyper-V https://github.com/microsoft/WSL/issues/6982#issuecomment-1727922357

anodynos commented 8 months ago

@anodynos, yeah, might be something related to hardware. Did the hangs start after WSL update to 2.0.14.0 or also before (but after upgrading CPU)

Hey @onereal7 - I updated the original comment, copying here:

I will update if I have more info, now I'm going to return the 11900KF cause it's driving me nuts (shame really, it was the best easy upgrade I could do in my rig).

carlfriedrich commented 8 months ago

Are we sure that a kernel commit is to blame for this problem? Last promising response I saw on the topic from @kelleymh seemed to suggest that this could be a bug in Hyper-V microsoft/WSL#6982 (comment)

@elkrammer No, we are not sure, that's why it says "Trying..." in the issue title. There's lots of reported user experience which make this appear quite likely, though. All I know from the comments (and I have read EVERY comment in the upstream issue) is that replacing the kernel to an older version is the ONLY thing that consistently fixed the issue for several people.

carlfriedrich commented 8 months ago

FYI, I have recently upgraded my Intel i7 10700F to i9 11900KF (on the same rig having 64GB RAM on a Gigabyte Z490 Aorus Elite AC Mobo) and the problem started as soon as I hibernated with the new CPU.

@anodynos This is definitely interesting. It is not relevant for this issue's progress, though, so I would like everyone to post these kind of news in the upstream issue in order to stay focused here. I am not a Microsoft developer, so I am not debugging or collecting any information related to the issue. We're just trying to find the kernel commit which introduced the issue, which can be ONE hint to the problem. For all other hints, please use the upstream issue's comment section. I will put this note into the issue description as well. Thanks for your understanding.

MrSnoozles commented 8 months ago

So in the upstream issue microsoft/WSL#6982 @MrSnoozles reported that he had the kernel hang with v5.4. If that is true, we might be searching for the issue on the wrong branch. I will stay on v5.4 on my work PC for a few weeks in order to verify this.

@carlfriedrich Swap was disabled in my WSL-Ubuntu. Since I noticed and enabled it again I didn't have any problem with 5.4 so I think the 5.4 branch is fine.

bogdan-radocea commented 8 months ago

Hallo. I've been tracking the Interrupt storm issue for a while now. Due to some unrelated stuff, I needed to reinstall my distro and do a complete setup from scratch. Since I needed complete systemd to have proper lvm mounting on boot I installed XanMod Kernel - 5 days+ no issues with hangs and CPU usage. Would any of you be willing to give it a test run for a couple of days? I think it would be revelant to see if I hit some weird perfect storm of settings which doesn't cause the issue, or if this kernel is stable :)

Hi. I installed it and will report back if the hang occurs again.

The kernel did not help. WSL still locks and uses 100%cpu. For me it was triggered when I removed the power adapter from my laptop and left it on battery mode. Will try to reproduce and get back to you with updates.

carlfriedrich commented 8 months ago

@carlfriedrich Swap was disabled in my WSL-Ubuntu. Since I noticed and enabled it again I didn't have any problem with 5.4 so I think the 5.4 branch is fine.

@MrSnoozles Thanks for reporting back! I am removing your negative vote for v5.4 then.

anodynos commented 8 months ago

FYI, I have recently upgraded my Intel i7 10700F to i9 11900KF (on the same rig having 64GB RAM on a Gigabyte Z490 Aorus Elite AC Mobo) and the problem started as soon as I hibernated with the new CPU.

@anodynos This is definitely interesting. It is not relevant for this issue's progress, though, so I would like everyone to post these kind of news in the upstream issue in order to stay focused here. I am not a Microsoft developer, so I am not debugging or collecting any information related to the issue. We're just trying to find the kernel commit which introduced the issue, which can be ONE hint to the problem. For all other hints, please use the upstream issue's comment section. I will put this note into the issue description as well. Thanks for your understanding.

UPDATE 2: It's been 2 days with many hibernation cycles & restarts, with heavy workload and no VMmem.exe issues with my old trusted 10700. It was the complete opposite with the 11900KF, despite passing all other stress tests fine: it was becoming unstable every time after hibernation. Its definitely something in the new CPU generation that triggers the bug!

I am on 5.15.133.1-1 and it works fine with WSL 2.0.14.0 & 2.0.9.0 (on the 10700F only)

carlfriedrich commented 8 months ago

@anodynos As I already kindly asked you yesterday, please stop posting any environmental information you are observing concerning the issue. It is not of any use here. We are using this issue to organize the kernel bisection and gather feedback about specific kernel images. Everything else should go into the upstream issue. Please move your comments over there.

skoelden commented 8 months ago

I've been using the g386403a115f for about two weeks now and i have not seen the 100% CPU issue.

$ uname --kernel-release
5.4.0-microsoft-standard-WSL2-02620-g386403a115f
alexeylark commented 8 months ago

I've been using this release for a week, and there wasn't any moments with 100 % CPU on VMmem and responsiveness issues. Also there was one time when memory started to creep up to the 3GB limit, but then dropped back to the usual ~1 GB, which never happened on later (5.5+) versions. Using the version: 5.4.0-microsoft-standard-WSL2-02620-g386403a115f

carlfriedrich commented 8 months ago

@skoelden @alexeylark Thanks for reporting. I also have been running v5.4-2622-g386403a115f for two weeks without any hangs, so I am marking this version as "good" now.

I continued the bisection, next candidate is v5.4-3434-g3f1b210a7f9. Happy testing!

onereal7 commented 8 months ago

v5.4-3434-g3f1b210a7f9 hung the first day

benzman81 commented 8 months ago

My co-worker and I switched back to Kernel 5.4.91 and we can confirm, that the issue is gone.

onereal7 commented 8 months ago

My co-worker and I switched back to Kernel 5.4.91 and we can confirm, that the issue is gone.

@benzman81, that's great to have more confirmations that we are on a right path, however that does not help for intention of this task. Could you please also contribute by testing one of the kernels provided here, currently it is v5.4-3434-g3f1b210a7f9

benzman81 commented 8 months ago

@onereal7 didnt you already post, that v5.4-3434-g3f1b210a7f9 already hung? We need to change some config in the kernel for our case, so we cannot use prebuild kernels.