carlfriedrich / wsl-kernel-build

15 stars 0 forks source link

Trying to find the kernel commit which makes WSL non-responsive #1

Open carlfriedrich opened 9 months ago

carlfriedrich commented 9 months ago

We're trying to find the kernel commit which makes WSL non-responsive after hibernation, which is described in the issues microsoft/WSL#8696 and microsoft/WSL#6982.

Our starting point

Bisecting the kernel

We have about 13,000 commits between v5.4 and v5.5-rc1. Using git bisect we should be able to track down the commit introducing the issue within 14 rounds. As a start, I have built the start and end versions and one in between. I will update this table as soon as the versions are confirmed to be working or non-working and add new versions as I continue the bisection. The links in the table lead to the release page for the corresponding version where you can download the kernel image.

Kernel version Good Reports good / bad
v5.4 6 / 0
v5.4-2622-g386403a115f 5 / 0
v5.4-2759-ga86f69d3349 8 / 0
v5.4-2809-ga25bbc2644f 4 / 0
v5.4-2816-gcd4771f7709 5 / 0
v5.4-2819-g64d6a12094f 0 / 3
v5.4-2824-g24ee25a6da8 1 / 4
v5.4-2841-gda42761df5c 1 / 4
v5.4-2929-g1d87200446f 0 / 4
v5.4-3127-g77a05940eee 0 / 2
v5.4-3434-g3f1b210a7f9 0 / 2
v5.4-4535-g9a3d7fd275b 0 / 2
v5.5-rc1 0 / 2

How you can help

I will wait for a reasonable number of reports for each version, so even if somebody else reported a working or non-working version before, please do report your experience as well.

How you cannot help

We're not looking for any workarounds or environment information related to the issue here. I am not a Microsoft developer, so I am not debugging the issue or collecting any information to help solving it. If you want to share any information of this kind, please do so in one of the upstream issues.

Thanks a lot for your help in advance. 💚


Update

We have found the kernel commit introducing the issue:

Merge commit: microsoft/WSL2-Linux-Kernel@64d6a12094f3

Atomic commit: microsoft/WSL2-Linux-Kernel@dce7cd62754b5

From here on I will try to build more recent kernel versions with the commit reverted. Feel free to use these and report your experience.

Kernel version Good Reports good / bad Notes
v5.5-rc1-1-g0622e5f6a3 3 / 0 v5.5-rc1 with 64d6a12094f3 reverted
v5.5-rc1-2-g0265cf1764 0 / 3 v5.5-rc1-1-g0622e5f6a3 with dce7cd62754b5 cherry-picked
linux-msft-wsl-5.10.102.2 3 / 0 linux-msft-wsl-5.10.102.1 with dce7cd62754b5 reverted
linux-msft-wsl-5.15.153.2 6 / 0 linux-msft-wsl-5.15.153.1 with dce7cd62754b5 reverted
onereal7 commented 8 months ago

We need to change some config in the kernel for our case, so we cannot use prebuild kernels.

Understood

didnt you already post, that v5.4-3434-g3f1b210a7f9 already hung

Yes, I partly agree with you, but that's just test of sample size 1 and what history with this issue has told us more testing is needed to be sure. However, in my opinion, more thorough testing is needed when WSL does not hang.

benzman81 commented 8 months ago

didnt you already post, that v5.4-3434-g3f1b210a7f9 already hung

Yes, I partly agree with you, but that's just test of sample size 1 and what history with this issue has told us more testing is needed to be sure. However, in my opinion, more thorough testing is needed when WSL does not hang.

For me, sample size 1 is OK for case when WSL does hang. This already shows, that the issue exists. The case that WSL does NOT hang needs to be tested over some time. That's why I waited to confirm Kernel 5.4.91. Its almos the fourth week without an issue and my co-worker starts now in his second week.

AntoineMurat commented 8 months ago

bzImage-5.4.0-microsoft-standard-WSL2-03434-g3f1b210a7f9 hung within 3 minutes.

Also, I don't think that a standard bisection is the optimal approach: it takes \~two weeks to flag a version as good, but only \~two days to flag one as bad. Instead of splitting the search space 50/50, we could go 80/20.

With the current 50/50 strategy, 2 weeks of bisecting can reduce the search space to anything between 0.8% (\~=0.5^7, assuming 7 consecutive bad releases) and 50% (assuming a good release). With the alternative 80/20 strategy, 2 weeks of bisecting always reduce the search space to 20-21% (0.8^7\~=0.2). This way, we would greatly reduce the variance of our ETA.

Also, this approach would lower our ETA. I don't know how to prove it mathematically, but I ran a simulation:

from random import randint
import numpy as np
import matplotlib.pyplot as plt

def run(commits, p):
  bad_commit = randint(1, commits - 1)
  left = 0 # always good
  right = commits - 1 # always bad
  days = 0
  while left + 1 != right:
    pick = left + 1 + int(p * (right - left - 1))
    if pick >= bad_commit:
      days += 2
      right = pick
    else:
      days += 14
      left = pick
  return days

def mc_eta(left, p):
  runs = list(run(left, p) for _ in range(1000000))
  return (np.mean(runs), np.std(runs))

x = list(range(1, 100))
runs = list(mc_eta(13000, p/100) for p in x)
eta = list(r[0] for r in runs)
std = list(r[1] for r in runs)

plt.plot(x, eta, label='eta')
plt.plot(x, std, label='std')
plt.axis([0, 100, 0, 250])
plt.grid()

plt.hlines(eta[49], 0, 100)
plt.text(0, eta[49], f'eta(50)={eta[49]:.2f}', ha='right', va='center', fontsize=7)
plt.hlines(np.min(eta), 0, 100)
plt.text(0, np.min(eta), f'eta({np.argmin(eta)+1})={np.min(eta):.2f}', ha='right', va='center', fontsize=7)

plt.hlines(std[49], 0, 100, color='orange')
plt.text(0, std[49], f'std(50)={std[49]:.2f}', ha='right', va='center', fontsize=7)
plt.hlines(np.min(std), 0, 100, color='orange')
plt.text(0, np.min(std), f'std({np.argmin(std)+1})={np.min(std):.2f}', ha='right', va='center', fontsize=7)

plt.legend()
plt.xlabel("Search space split")
plt.savefig('plot.png')

plot

Adopting a 80/20 split would save 20+ days for 13,000 commits (I don't know how many we have left).

chrispaton commented 8 months ago

Big thanks for co-ordinating this effort @carlfriedrich . This issue has annoyed me for at least a year and today was the day I decided that was enough was enough.

I tested bzImage-5.4.0-microsoft-standard-WSL2-03434-g3f1b210a7f9 and couldn't get it to start at all even after multiple attempts:

A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. Error code: Wsl/Service/0x8007274c

I successfully got bzImage-5.4.0-microsoft-standard-WSL2-02620-g386403a115f running so I think I'm editing the config correctly etc. Will keep subscribed for future versions to try.

carlfriedrich commented 8 months ago

@AntoineMurat Thanks a lot for your report and the suggestion about reducing the ETA. That's quite interesting, indeed. I did not find an option to git bisect to set the ratio, though, it seems to always work with 50/50. I am not willing to select commits by hand, coordinating this issue is more than enough work on this I can do in my spare time. So I will stick to using git bisect unless you can show me a convenient way without more effort than calling git bisect good or git bisect bad for each test candidate.

carlfriedrich commented 8 months ago

We have two "hang" reports for v5.4-3434-g3f1b210a7f9, so I am marking it "bad".

Next up is v5.4-3127-g77a05940eee.

Thanks everybody participating in testing the images! So great to see this becoming a community effort! 🤩

AntoineMurat commented 8 months ago

5.4.0-microsoft-standard-WSL2-03127-g77a05940eee is a bad commit, it hung after 2 hibernation cycles (seems like going 50/50 was a good bet this time ;) ) EDIT: I also wanted to thank you for coordinating this effort. :)

tobyvinnell commented 8 months ago

I just wanted to report back another good result for v5.4-2622-g386403a115f. Two weeks without a hang. I'll move to 5.4.0-microsoft-standard-WSL2-03127-g77a05940eee

carlfriedrich commented 8 months ago

I also encountered a hang right after the first hibernation with v5.4-3127-g77a05940eee, so I just pushed a new version. Will post the link here tomorrow, feel free to already get it from the releases page in ~30 minutes.

carlfriedrich commented 8 months ago

So next candidate is v5.4-2929-g1d87200446f. Happy testing!

AntoineMurat commented 8 months ago

5.4.0-microsoft-standard-WSL2-02929-g1d87200446f just hung on my side :)

bogdan-radocea commented 7 months ago

So next candidate is v5.4-2929-g1d87200446f. Happy testing!

Installed and testing now

uqiu commented 7 months ago

sorry guys. cant starting wsl with v5.4-2929-g1d87200446f

bogdan-radocea commented 7 months ago

sorry guys. cant starting wsl with v5.4-2929-g1d87200446f

Yeah, I noticed that myself. It does start, just a bit slower. Also I needed to reboot after the first kernel usage with the new version, since it locked wsl and I had no idea if it was caused by the change or anything else. Be a bit patient and it should start and work normally.

bogdan-radocea commented 7 months ago

Ok, so bzImage-5.4.0-microsoft-standard-WSL2-02929-g1d87200446f just froze for me. Seems that this is the one that is causing problems?

carlfriedrich commented 7 months ago

Thanks for your feedback. So we have two "hang" reports for v5.4-2929-g1d87200446f, marking it "bad".

Next up for testing is v5.4-2759-ga86f69d3349.

unwiredben commented 7 months ago

I've tried WSL startup with v5.4-2759-ga86f69d3349 three times now and it hangs each time on my Windows 10 laptop, eventually giving a WSL service timeout error. I did try a device reboot across attempts, and that didnt' help. If I revert my .wslconfig file back to bzImage-5.4.0-microsoft-standard-WSL2-02620-g386403a115f then reboot, I'm able to start right away.

A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
Error code: Wsl/Service/0x8007274c
michael-markl commented 7 months ago

@unwiredben I had the same problem, but after swapping back to bzImage-5.4.0-microsoft-standard-WSL2-02929-g1d87200446f and then again to bzImage-5.4.0-microsoft-standard-WSL2-02759-ga86f69d3349, I could get it running. I have no idea what was causing the problem.

mungojam commented 7 months ago

Is it worth jumping to a nearby commit that was an actual wsl release or pre release? An individual commit may well have some unrelated issue

unwiredben commented 7 months ago

@michael-markl you were right -- I tried one more time and this time WSL came up with the ga86f69d3349 build. Starting my testing now.

chrispaton commented 7 months ago

Just a further confirmation that bzImage-5.4.0-microsoft-standard-WSL2-02929-g1d87200446f hung on resume.

Tried bzImage-5.4.0-microsoft-standard-WSL2-02759-ga86f69d3349 and as others mentioned it failed with the "did not properly respond after a period of time" error a couple of times till I tried killing wslservice.exe and I eventually got it running. Will report back in due course.

carlfriedrich commented 7 months ago

I've tried WSL startup with v5.4-2759-ga86f69d3349 three times now and it hangs each time on my Windows 10 laptop, eventually giving a WSL service timeout error. I did try a device reboot across attempts, and that didnt' help. If I revert my .wslconfig file back to bzImage-5.4.0-microsoft-standard-WSL2-02620-g386403a115f then reboot, I'm able to start right away.

A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
Error code: Wsl/Service/0x8007274c

@unwiredben I had this several times with other versions as well, and other users also described it in previous comments here and here. Someone reported here that renaming the kernel file to a shorter name fixed it for him, but that might be coincidende, haven't verified this. I actually don't know why and when this happens, but I always managed to get it running after a few attempts.

Is it worth jumping to a nearby commit that was an actual wsl release or pre release? An individual commit may well have some unrelated issue

@mungojam There have been no releases between our starting points v5.4 and v5.5-rc1. We're not even on a WSL branch, those were upstream Linux kernel releases. So the issue above indeed might be caused by the fact that we're on mainline Linux here, without any WSL patches.

tobyvinnell commented 7 months ago

I got a hang on first resume with 5.4.0-microsoft-standard-WSL2-03127-g77a05940eee so I switched to bzImage-5.4.0-microsoft-standard-WSL2-02929-g1d87200446f
I've run this for a whole week before getting a hang just now. Unusually though I was able to restore it with "wsl --shutdown" usually I have to do "taskkill /f /im wslservice.exe" So it doesn't seem as bad the newer versions.

bogdan-radocea commented 7 months ago

Just a further confirmation that bzImage-5.4.0-microsoft-standard-WSL2-02929-g1d87200446f hung on resume.

Tried bzImage-5.4.0-microsoft-standard-WSL2-02759-ga86f69d3349 and as others mentioned it failed with the "did not properly respond after a period of time" error a couple of times till I tried killing wslservice.exe and I eventually got it running. Will report back in due course.

Hi. you don't need to kill wsl for this. just open a new terminal tab with wsl (I use terminal preview) and it should work. Close the non-responsive tab and all other sessions will work just fine (until reboot). At least this works for me on 5.4.0-microsoft-standard-WSL2-02620-g386403a115f.

bogdan-radocea commented 7 months ago

I've tried WSL startup with v5.4-2759-ga86f69d3349 three times now and it hangs each time on my Windows 10 laptop, eventually giving a WSL service timeout error. I did try a device reboot across attempts, and that didnt' help. If I revert my .wslconfig file back to bzImage-5.4.0-microsoft-standard-WSL2-02620-g386403a115f then reboot, I'm able to start right away.

A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
Error code: Wsl/Service/0x8007274c

@unwiredben I had this several times with other versions as well, and other users also described it in previous comments here and here. Someone reported here that renaming the kernel file to a shorter name fixed it for him, but that might be coincidende, haven't verified this. I actually don't know why and when this happens, but I always managed to get it running after a few attempts.

Is it worth jumping to a nearby commit that was an actual wsl release or pre release? An individual commit may well have some unrelated issue

@mungojam There have been no releases between our starting points v5.4 and v5.5-rc1. We're not even on a WSL branch, those were upstream Linux kernel releases. So the issue above indeed might be caused by the fact that we're on mainline Linux here, without any WSL patches.

The hang happens on the official wsl kernel as well, so it should not be caused by any wsl specific patches that are missing.

chrispaton commented 7 months ago

After the slow first start, 5.4.0-microsoft-standard-WSL2-02759-ga86f69d3349 has come back successfully from multiple hibernates for me - so certainly seems "good" to me.

But heads-up that I noticed that my WSL clock ended up skewed - it looks like this was a known issue with past kernels as per Clock skew issues megathread · Issue #10006 · microsoft/WSL. The easiest fix was running the following from PowerShell after each resume from hibernate:

wsl -u root date -s $(date)
tobyvinnell commented 7 months ago

Another hang on 5.4.0-microsoft-standard-WSL2-02929-g1d87200446f . I'll move to v5.4-2759-ga86f69d3349

onereal7 commented 7 months ago

So far so good with 5.4.0-microsoft-standard-WSL2-02759-ga86f69d3349 - two restarts with multiple hibernates for each - all good, no hang.

bogdan-radocea commented 7 months ago

So far so good with 5.4.0-microsoft-standard-WSL2-02759-ga86f69d3349 - two restarts with multiple hibernates for each - all good, no hang.

Hi. Same for me. Multiple hibernates and reboots, windows updates, all good so far.

The only (seemingly unrelated) problem is still the long wsl startup times and terminal timeouts, but it does start just fine. I decreased the kernel name but it seems that it's still too large or the fix is not working (current name is: bzImage-ga86f69d3349).

carlfriedrich commented 7 months ago

Thanks for reporting everyone. I did not have any hangs with v5.4-2759-ga86f69d3349 either. Will wait another few days before moving on to see if any more reports come in.

unwiredben commented 7 months ago

I'm not having any hangs with v5.4-2759-ga86f69d3349 either, other than the occasional startup issues after a device reboot.

adn77 commented 7 months ago

No hangs either with v5.4-2759-ga86f69d3349. Only the clock is skewed.

mungojam commented 7 months ago

It's starting to get feasible for somebody who knows anything about kernels to look at the diff between working and non-working commits.

I'll keep this updated as the number of commits reduces, now down to 5 commits

https://github.com/microsoft/WSL2-Linux-Kernel/compare/cd4771f7709...64d6a12094f

carlfriedrich commented 7 months ago

Okay, I marked v5.4-2759-ga86f69d3349 good.

Next up is v5.4-2841-gda42761df5c.

And motivating news: we're over halfway through the bisection (8 versions verified, 7 more to check)! 🚀 Thanks everyone for your testing efforts.

onereal7 commented 7 months ago

v5.4-2841-gda42761df5c hangs for me

tobyvinnell commented 7 months ago

Another tick for WSL2-02759-ga86f69d3349 as no hangs for a couple of weeks. Moving to WSL2-02841-gda42761df5c

alexeylark commented 7 months ago

Plus one that v5.4-2759-ga86f69d3349 works fine after a couple of weeks. Sorry if I'm late to the party.

bogdan-radocea commented 7 months ago

v5.4-2841-gda42761df5c hangs for me

For me it's doesn't freeze. Have been running it for 5 days with multiple hibernates and reboots. Will keep it until we need to move on to the next one.

sachinholla commented 7 months ago

da42761df5c hung once for me.

Note: antivirus on my work laptop did not allow me to download v5.4-2841-gda42761df5c from here.. So I built it myself by following these instructions -- replaced git checkout v5.4 with git checkout da42761df5ce. But the kernel version is displayed as "5.4.0-microsoft-standard-WSL2-03688-gda42761df5ce". Hopefully this is fine.

tobyvinnell commented 7 months ago

I got a hang this morning on WSL2-02841-gda42761df5c that's after the first resume from hibernate.

joethesaint commented 7 months ago

Can I get a working fix even if it is versions behind so as to get back to using a working WSL?

bogdan-radocea commented 7 months ago

Can I get a working fix even if it is versions behind so as to get back to using a working WSL?

you can check the top of this thread on a working kernel version. you can download and use it as per the instructions on the first post. add the path to your kernel in c:\Users\<your_username>\.wslconfig For example:

[wsl2]
kernel = d:\\.WSL\\bzImage-ga86f69d3349
joethesaint commented 7 months ago

Can I get a working fix even if it is versions behind so as to get back to using a working WSL?

you can check the top of this thread on a working kernel version. you can download and use it as per the instructions on the first post. add the path to your kernel in c:\Users\<your_username>\.wslconfig For example:

[wsl2]
kernel = d:\\.WSL\\bzImage-ga86f69d3349

There is a lot of comment on this thread already, can you please tag me to it or is it already pinned?

joethesaint commented 7 months ago

Please how do I get to set up the version? Do I have to use docker?

carlfriedrich commented 7 months ago

da42761df5c hung once for me.

Note: antivirus on my work laptop did not allow me to download v5.4-2841-gda42761df5c from here.. So I built it myself by following these instructions -- replaced git checkout v5.4 with git checkout da42761df5ce. But the kernel version is displayed as "5.4.0-microsoft-standard-WSL2-03688-gda42761df5ce". Hopefully this is fine.

@sachinholla For some reason git describe shows this version in my local tree as well. Not sure why. I am gonna count your vote, anyway. Thanks for your feedback!

carlfriedrich commented 7 months ago

Can I get a working fix even if it is versions behind so as to get back to using a working WSL?

@joethesaint You can use any version with a ✅ listed in the issue description. The table has links to the release pages for each version where you can download the kernel image.

You'll find instructions how to use the image in the README.

carlfriedrich commented 7 months ago

Given three "bad" counts for v5.4-2841-gda42761df5c, I continued the bisection. Thanks for your feedback everyone.

Next up is v5.4-2809-ga25bbc2644f. Happy testing. 🤓

aquohn commented 6 months ago

I've been using v5.4 for over a month now with no problems, over many hibernates and several restarts.

chrispaton commented 6 months ago

Had a couple of successful resumes with 5.4.0-microsoft-standard-WSL2-02809-ga25bbc2644f so can consider that "good" from me unless I come back to say otherwise. This kernel also loses track of date during hibernation so I had to use the run wsl -u root date -s $(date) from PowerShell trick again.

bjanders commented 6 months ago

I tried to use v5.4-2809-ga25bbc2644f, but it seems Docker Desktop is not compatible with it, and I need Docker Desktop. Do you know if there is any way to make Docker Desktop work with these builds?