Open carlfriedrich opened 9 months ago
We need to change some config in the kernel for our case, so we cannot use prebuild kernels.
Understood
didnt you already post, that v5.4-3434-g3f1b210a7f9 already hung
Yes, I partly agree with you, but that's just test of sample size 1 and what history with this issue has told us more testing is needed to be sure. However, in my opinion, more thorough testing is needed when WSL does not hang.
didnt you already post, that v5.4-3434-g3f1b210a7f9 already hung
Yes, I partly agree with you, but that's just test of sample size 1 and what history with this issue has told us more testing is needed to be sure. However, in my opinion, more thorough testing is needed when WSL does not hang.
For me, sample size 1 is OK for case when WSL does hang. This already shows, that the issue exists. The case that WSL does NOT hang needs to be tested over some time. That's why I waited to confirm Kernel 5.4.91. Its almos the fourth week without an issue and my co-worker starts now in his second week.
bzImage-5.4.0-microsoft-standard-WSL2-03434-g3f1b210a7f9 hung within 3 minutes.
Also, I don't think that a standard bisection is the optimal approach: it takes \~two weeks to flag a version as good, but only \~two days to flag one as bad. Instead of splitting the search space 50/50, we could go 80/20.
With the current 50/50 strategy, 2 weeks of bisecting can reduce the search space to anything between 0.8% (\~=0.5^7, assuming 7 consecutive bad releases) and 50% (assuming a good release). With the alternative 80/20 strategy, 2 weeks of bisecting always reduce the search space to 20-21% (0.8^7\~=0.2). This way, we would greatly reduce the variance of our ETA.
Also, this approach would lower our ETA. I don't know how to prove it mathematically, but I ran a simulation:
from random import randint
import numpy as np
import matplotlib.pyplot as plt
def run(commits, p):
bad_commit = randint(1, commits - 1)
left = 0 # always good
right = commits - 1 # always bad
days = 0
while left + 1 != right:
pick = left + 1 + int(p * (right - left - 1))
if pick >= bad_commit:
days += 2
right = pick
else:
days += 14
left = pick
return days
def mc_eta(left, p):
runs = list(run(left, p) for _ in range(1000000))
return (np.mean(runs), np.std(runs))
x = list(range(1, 100))
runs = list(mc_eta(13000, p/100) for p in x)
eta = list(r[0] for r in runs)
std = list(r[1] for r in runs)
plt.plot(x, eta, label='eta')
plt.plot(x, std, label='std')
plt.axis([0, 100, 0, 250])
plt.grid()
plt.hlines(eta[49], 0, 100)
plt.text(0, eta[49], f'eta(50)={eta[49]:.2f}', ha='right', va='center', fontsize=7)
plt.hlines(np.min(eta), 0, 100)
plt.text(0, np.min(eta), f'eta({np.argmin(eta)+1})={np.min(eta):.2f}', ha='right', va='center', fontsize=7)
plt.hlines(std[49], 0, 100, color='orange')
plt.text(0, std[49], f'std(50)={std[49]:.2f}', ha='right', va='center', fontsize=7)
plt.hlines(np.min(std), 0, 100, color='orange')
plt.text(0, np.min(std), f'std({np.argmin(std)+1})={np.min(std):.2f}', ha='right', va='center', fontsize=7)
plt.legend()
plt.xlabel("Search space split")
plt.savefig('plot.png')
Adopting a 80/20 split would save 20+ days for 13,000 commits (I don't know how many we have left).
Big thanks for co-ordinating this effort @carlfriedrich . This issue has annoyed me for at least a year and today was the day I decided that was enough was enough.
I tested bzImage-5.4.0-microsoft-standard-WSL2-03434-g3f1b210a7f9 and couldn't get it to start at all even after multiple attempts:
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. Error code: Wsl/Service/0x8007274c
I successfully got bzImage-5.4.0-microsoft-standard-WSL2-02620-g386403a115f running so I think I'm editing the config correctly etc. Will keep subscribed for future versions to try.
@AntoineMurat Thanks a lot for your report and the suggestion about reducing the ETA. That's quite interesting, indeed. I did not find an option to git bisect
to set the ratio, though, it seems to always work with 50/50. I am not willing to select commits by hand, coordinating this issue is more than enough work on this I can do in my spare time. So I will stick to using git bisect
unless you can show me a convenient way without more effort than calling git bisect good
or git bisect bad
for each test candidate.
We have two "hang" reports for v5.4-3434-g3f1b210a7f9, so I am marking it "bad".
Next up is v5.4-3127-g77a05940eee.
Thanks everybody participating in testing the images! So great to see this becoming a community effort! 🤩
5.4.0-microsoft-standard-WSL2-03127-g77a05940eee is a bad commit, it hung after 2 hibernation cycles (seems like going 50/50 was a good bet this time ;) ) EDIT: I also wanted to thank you for coordinating this effort. :)
I just wanted to report back another good result for v5.4-2622-g386403a115f. Two weeks without a hang. I'll move to 5.4.0-microsoft-standard-WSL2-03127-g77a05940eee
I also encountered a hang right after the first hibernation with v5.4-3127-g77a05940eee, so I just pushed a new version. Will post the link here tomorrow, feel free to already get it from the releases page in ~30 minutes.
So next candidate is v5.4-2929-g1d87200446f. Happy testing!
5.4.0-microsoft-standard-WSL2-02929-g1d87200446f just hung on my side :)
So next candidate is v5.4-2929-g1d87200446f. Happy testing!
Installed and testing now
sorry guys. cant starting wsl with v5.4-2929-g1d87200446f
sorry guys. cant starting wsl with v5.4-2929-g1d87200446f
Yeah, I noticed that myself. It does start, just a bit slower. Also I needed to reboot after the first kernel usage with the new version, since it locked wsl and I had no idea if it was caused by the change or anything else. Be a bit patient and it should start and work normally.
Ok, so bzImage-5.4.0-microsoft-standard-WSL2-02929-g1d87200446f just froze for me. Seems that this is the one that is causing problems?
Thanks for your feedback. So we have two "hang" reports for v5.4-2929-g1d87200446f, marking it "bad".
Next up for testing is v5.4-2759-ga86f69d3349.
I've tried WSL startup with v5.4-2759-ga86f69d3349 three times now and it hangs each time on my Windows 10 laptop, eventually giving a WSL service timeout error. I did try a device reboot across attempts, and that didnt' help. If I revert my .wslconfig file back to bzImage-5.4.0-microsoft-standard-WSL2-02620-g386403a115f then reboot, I'm able to start right away.
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
Error code: Wsl/Service/0x8007274c
@unwiredben I had the same problem, but after swapping back to bzImage-5.4.0-microsoft-standard-WSL2-02929-g1d87200446f
and then again to bzImage-5.4.0-microsoft-standard-WSL2-02759-ga86f69d3349
, I could get it running. I have no idea what was causing the problem.
Is it worth jumping to a nearby commit that was an actual wsl release or pre release? An individual commit may well have some unrelated issue
@michael-markl you were right -- I tried one more time and this time WSL came up with the ga86f69d3349 build. Starting my testing now.
Just a further confirmation that bzImage-5.4.0-microsoft-standard-WSL2-02929-g1d87200446f hung on resume.
Tried bzImage-5.4.0-microsoft-standard-WSL2-02759-ga86f69d3349 and as others mentioned it failed with the "did not properly respond after a period of time" error a couple of times till I tried killing wslservice.exe and I eventually got it running. Will report back in due course.
I've tried WSL startup with v5.4-2759-ga86f69d3349 three times now and it hangs each time on my Windows 10 laptop, eventually giving a WSL service timeout error. I did try a device reboot across attempts, and that didnt' help. If I revert my .wslconfig file back to bzImage-5.4.0-microsoft-standard-WSL2-02620-g386403a115f then reboot, I'm able to start right away.
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. Error code: Wsl/Service/0x8007274c
@unwiredben I had this several times with other versions as well, and other users also described it in previous comments here and here. Someone reported here that renaming the kernel file to a shorter name fixed it for him, but that might be coincidende, haven't verified this. I actually don't know why and when this happens, but I always managed to get it running after a few attempts.
Is it worth jumping to a nearby commit that was an actual wsl release or pre release? An individual commit may well have some unrelated issue
@mungojam There have been no releases between our starting points v5.4 and v5.5-rc1. We're not even on a WSL branch, those were upstream Linux kernel releases. So the issue above indeed might be caused by the fact that we're on mainline Linux here, without any WSL patches.
I got a hang on first resume with 5.4.0-microsoft-standard-WSL2-03127-g77a05940eee so I switched to bzImage-5.4.0-microsoft-standard-WSL2-02929-g1d87200446f
I've run this for a whole week before getting a hang just now. Unusually though I was able to restore it with "wsl --shutdown" usually I have to do "taskkill /f /im wslservice.exe"
So it doesn't seem as bad the newer versions.
Just a further confirmation that bzImage-5.4.0-microsoft-standard-WSL2-02929-g1d87200446f hung on resume.
Tried bzImage-5.4.0-microsoft-standard-WSL2-02759-ga86f69d3349 and as others mentioned it failed with the "did not properly respond after a period of time" error a couple of times till I tried killing wslservice.exe and I eventually got it running. Will report back in due course.
Hi. you don't need to kill wsl for this. just open a new terminal tab with wsl (I use terminal preview) and it should work. Close the non-responsive tab and all other sessions will work just fine (until reboot). At least this works for me on 5.4.0-microsoft-standard-WSL2-02620-g386403a115f.
I've tried WSL startup with v5.4-2759-ga86f69d3349 three times now and it hangs each time on my Windows 10 laptop, eventually giving a WSL service timeout error. I did try a device reboot across attempts, and that didnt' help. If I revert my .wslconfig file back to bzImage-5.4.0-microsoft-standard-WSL2-02620-g386403a115f then reboot, I'm able to start right away.
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. Error code: Wsl/Service/0x8007274c
@unwiredben I had this several times with other versions as well, and other users also described it in previous comments here and here. Someone reported here that renaming the kernel file to a shorter name fixed it for him, but that might be coincidende, haven't verified this. I actually don't know why and when this happens, but I always managed to get it running after a few attempts.
Is it worth jumping to a nearby commit that was an actual wsl release or pre release? An individual commit may well have some unrelated issue
@mungojam There have been no releases between our starting points v5.4 and v5.5-rc1. We're not even on a WSL branch, those were upstream Linux kernel releases. So the issue above indeed might be caused by the fact that we're on mainline Linux here, without any WSL patches.
The hang happens on the official wsl kernel as well, so it should not be caused by any wsl specific patches that are missing.
After the slow first start, 5.4.0-microsoft-standard-WSL2-02759-ga86f69d3349 has come back successfully from multiple hibernates for me - so certainly seems "good" to me.
But heads-up that I noticed that my WSL clock ended up skewed - it looks like this was a known issue with past kernels as per Clock skew issues megathread · Issue #10006 · microsoft/WSL. The easiest fix was running the following from PowerShell after each resume from hibernate:
wsl -u root date -s $(date)
Another hang on 5.4.0-microsoft-standard-WSL2-02929-g1d87200446f . I'll move to v5.4-2759-ga86f69d3349
So far so good with 5.4.0-microsoft-standard-WSL2-02759-ga86f69d3349 - two restarts with multiple hibernates for each - all good, no hang.
So far so good with 5.4.0-microsoft-standard-WSL2-02759-ga86f69d3349 - two restarts with multiple hibernates for each - all good, no hang.
Hi. Same for me. Multiple hibernates and reboots, windows updates, all good so far.
The only (seemingly unrelated) problem is still the long wsl startup times and terminal timeouts, but it does start just fine. I decreased the kernel name but it seems that it's still too large or the fix is not working (current name is: bzImage-ga86f69d3349).
Thanks for reporting everyone. I did not have any hangs with v5.4-2759-ga86f69d3349 either. Will wait another few days before moving on to see if any more reports come in.
I'm not having any hangs with v5.4-2759-ga86f69d3349 either, other than the occasional startup issues after a device reboot.
No hangs either with v5.4-2759-ga86f69d3349. Only the clock is skewed.
It's starting to get feasible for somebody who knows anything about kernels to look at the diff between working and non-working commits.
I'll keep this updated as the number of commits reduces, now down to 5 commits
https://github.com/microsoft/WSL2-Linux-Kernel/compare/cd4771f7709...64d6a12094f
Okay, I marked v5.4-2759-ga86f69d3349 good.
Next up is v5.4-2841-gda42761df5c.
And motivating news: we're over halfway through the bisection (8 versions verified, 7 more to check)! 🚀 Thanks everyone for your testing efforts.
v5.4-2841-gda42761df5c hangs for me
Another tick for WSL2-02759-ga86f69d3349 as no hangs for a couple of weeks. Moving to WSL2-02841-gda42761df5c
Plus one that v5.4-2759-ga86f69d3349 works fine after a couple of weeks. Sorry if I'm late to the party.
v5.4-2841-gda42761df5c hangs for me
For me it's doesn't freeze. Have been running it for 5 days with multiple hibernates and reboots. Will keep it until we need to move on to the next one.
da42761df5c hung once for me.
Note: antivirus on my work laptop did not allow me to download v5.4-2841-gda42761df5c from here.. So I built it myself by following these instructions -- replaced git checkout v5.4
with git checkout da42761df5ce
. But the kernel version is displayed as "5.4.0-microsoft-standard-WSL2-03688-gda42761df5ce". Hopefully this is fine.
I got a hang this morning on WSL2-02841-gda42761df5c that's after the first resume from hibernate.
Can I get a working fix even if it is versions behind so as to get back to using a working WSL?
Can I get a working fix even if it is versions behind so as to get back to using a working WSL?
you can check the top of this thread on a working kernel version. you can download and use it as per the instructions on the first post.
add the path to your kernel in c:\Users\<your_username>\.wslconfig
For example:
[wsl2]
kernel = d:\\.WSL\\bzImage-ga86f69d3349
Can I get a working fix even if it is versions behind so as to get back to using a working WSL?
you can check the top of this thread on a working kernel version. you can download and use it as per the instructions on the first post. add the path to your kernel in
c:\Users\<your_username>\.wslconfig
For example:[wsl2] kernel = d:\\.WSL\\bzImage-ga86f69d3349
There is a lot of comment on this thread already, can you please tag me to it or is it already pinned?
Please how do I get to set up the version? Do I have to use docker?
da42761df5c hung once for me.
Note: antivirus on my work laptop did not allow me to download v5.4-2841-gda42761df5c from here.. So I built it myself by following these instructions -- replaced
git checkout v5.4
withgit checkout da42761df5ce
. But the kernel version is displayed as "5.4.0-microsoft-standard-WSL2-03688-gda42761df5ce". Hopefully this is fine.
@sachinholla For some reason git describe
shows this version in my local tree as well. Not sure why. I am gonna count your vote, anyway. Thanks for your feedback!
Can I get a working fix even if it is versions behind so as to get back to using a working WSL?
@joethesaint You can use any version with a ✅ listed in the issue description. The table has links to the release pages for each version where you can download the kernel image.
You'll find instructions how to use the image in the README.
Given three "bad" counts for v5.4-2841-gda42761df5c, I continued the bisection. Thanks for your feedback everyone.
Next up is v5.4-2809-ga25bbc2644f. Happy testing. 🤓
I've been using v5.4 for over a month now with no problems, over many hibernates and several restarts.
Had a couple of successful resumes with 5.4.0-microsoft-standard-WSL2-02809-ga25bbc2644f so can consider that "good" from me unless I come back to say otherwise. This kernel also loses track of date during hibernation so I had to use the run wsl -u root date -s $(date)
from PowerShell trick again.
I tried to use v5.4-2809-ga25bbc2644f, but it seems Docker Desktop is not compatible with it, and I need Docker Desktop. Do you know if there is any way to make Docker Desktop work with these builds?
We're trying to find the kernel commit which makes WSL non-responsive after hibernation, which is described in the issues microsoft/WSL#8696 and microsoft/WSL#6982.
Our starting point
Bisecting the kernel
We have about 13,000 commits between v5.4 and v5.5-rc1. Using
git bisect
we should be able to track down the commit introducing the issue within 14 rounds. As a start, I have built the start and end versions and one in between. I will update this table as soon as the versions are confirmed to be working or non-working and add new versions as I continue the bisection. The links in the table lead to the release page for the corresponding version where you can download the kernel image.How you can help
uname --kernel-release
in your comment.I will wait for a reasonable number of reports for each version, so even if somebody else reported a working or non-working version before, please do report your experience as well.
How you cannot help
We're not looking for any workarounds or environment information related to the issue here. I am not a Microsoft developer, so I am not debugging the issue or collecting any information to help solving it. If you want to share any information of this kind, please do so in one of the upstream issues.
Thanks a lot for your help in advance. 💚
Update
We have found the kernel commit introducing the issue:
Merge commit: microsoft/WSL2-Linux-Kernel@64d6a12094f3
Atomic commit: microsoft/WSL2-Linux-Kernel@dce7cd62754b5
From here on I will try to build more recent kernel versions with the commit reverted. Feel free to use these and report your experience.