Stichting-MINIX-Research-Foundation / minix

Official MINIX sources - Automatically replicated from gerrit.minix3.org
Other
2.98k stars 969 forks source link

pkgin update stalls out on VMware Player in 3.4.0rc6 #221

Open andrusky opened 7 years ago

andrusky commented 7 years ago

I recently tried installing 3.4.0rc6 in a virtual machine (type: other) on VMware Workstation Player 12.5.

The installation was successful, but when attempting to perform the command "pkgin update" I've found that the download of the file pkg_summary.bz2 stalls out before completing. I've seen it stop download when only having downloaded 17% of the file, and when having downloaded as much as 83% of the file. But never have gotten it to complete.

I've tried this with the default virtual NIC (vmlance) and with the alternate (e1000).

Some initial investigation was done with help from people in #minix-support, and I was asked to file this ticket. Some relevant text:

by the looks of it, the problem is with outgoing packets tcpdump is showing the webserver's tcp stack sending zero-window probes while the local client is ack'ing everything with a fully open window
dcvmoole commented 7 years ago

Thanks again :) I am marking this as a release blocker, as the issue is unlikely to be limited to VMware, and thus has the potential to prevent many people from installing packages after installing MINIX.

I am still looking into the issue, but so far, it seems to be a mismatch of behaviors between lwIP and our web server (which I believe is running Linux) when it comes to processing of TCP zero-window probes sent to lwIP. As such, this would likely be a bug in lwIP - if that is indeed the case, hopefully we can resolve the bug in lwIP and then import the new lwIP version into MINIX. At the very least, this will take some time, though..

-David/Saturn

Edit: just to clarify, while I don't see a reason for this problem to be limited to VMware, I have so far managed to reproduce the issue on VMware only. However, on VMware, the issue does indeed occur all the time. My best guess would be that this has something to do with the way it does CPU or network scheduling.

sambuc commented 7 years ago
  1. Just as a question, on which OS are you running VMware, because I have been using VMware Fusion (running on MacOS X) to test the new network stack, and it worked. I even did the last full pbulk of PKGSRC in that VM.

    So either this is a problem introduced with the small fixes we made compared to the first versions of the LwIP stack, or this is an issue of VMware on Windows, maybe?

    I will come back with feedback after upgrading that VM to master.

  2. The Minix web server is running lighttpd + Linux 3.13.

dcvmoole commented 7 years ago

Both of us are using Windows (@andrusky mentioned Windows 10, I'm using Windows 7). If I recall correctly, there are quite a few important under-the-hood differences between VMware Workstation/Player and VMware Fusion. On IRC, stux reports that the same problem does not occur with VMware 6.0.5 on Windows 7 (32-bit).

In addition, as far as I can tell, nothing that could affect this issue has changed in the network stack code after the version that you tried. I would indeed love to hear how minix-current is doing on Fusion, though, just to eliminate that angle.

For now, I think it all adds up to the preliminary conclusion that it takes a fairly specific VMware version and possibly host platform to trigger this problem, which might be somewhat good news. However, judging from packet dumps, this is not a problem that can be caused by VMware, and thus may still occur anywhere else as well - although perhaps not as easily.

andrusky commented 7 years ago

Yes, I'm running Windows 10.

I just tried VirtualBox with rc6 and pgkin update downloaded the file just fine. The network device in the VirtualBox VM was le0, like when I used the vmlance device in VMWare, so the two must be emulating similar devices.

sambuc commented 7 years ago

After rebuilding master and rebooting, still no issues here.

Can one of you provide me his buggy VM, so that I can try to run it on my machine to see if I can try to repeat the problem?

I will do the same so that you can try my VM in your environments. That way, at least, we will be sure to know if this is a VMware/Windows issue or not.

andrusky commented 7 years ago

I can upload a tgz of my Minix VM folder to my university homepage tonight. On a metered connection at the moment.

Or will github take a file of that size?

andrusky commented 7 years ago

You can download my Minix VM Folder at:

https://sites.ualberta.ca/~kla2/Minix.tgz

Oh, guess I should mention. I changed the root password to minix

sambuc commented 7 years ago

I did a couple of quick tests:

  1. Mac OS X 10.12.4

    • I used your VM on VMware Fusion 7.1.3 (3204469)
    • I had to downgrade the hardware compatibility level to 11, as the one used was never than what my version knows. It seems VMware Fusion 8 is out, but I don't have license for that.
    • Works perfectly there
  2. Windows 10 Pro, 64bit (build 15063) 10.0.15063

    • I used your VM on VMware Player 12.5.5 build-5234757
    • As reported I could reproduce the network stall
    • By switching the network card from NAT to bridged, I was able to have a working pkgin Activating or not Replicate physical network connection state doesn't change anything.

While I have a working workaround, there is an issue with VMware NAT networking, on recent products for Windows.

dcvmoole commented 7 years ago

Thanks for the tests, @sambuc. That nicely matches my own updated view of the problem..

..which is that this is primarily a VMware problem after all. Based on wireshark packet captures of both the VMware virtual network pre-NAT side (VMnet8) and the post-NAT physical interface of my system, I can only conclude that at least on Windows, VMware's NAT system uses transparent TCP proxying rather than packet-level NAT. As such, VMware's NAT system comes with its own TCP/IP implementation to deal with TCP connections between the virtual machine and VMware's transparent proxy.

The issue in this thread, then, is the result of a mismatch between a peculiarity of the lwIP implementation of receive window management, and a rather restrictive implementation of the VMware TCP proxy's TCP/IP implementation. In other words: while MINIX's LWIP service can talk to the minix3 webserver directly just fine, the service has problems talking to the VMware NAT implementation. This explains why this is not an issue on real systems or VirtualBox, and it also explains why only VMware's NAT mode is affected, and not its bridging mode.

Given that the big-picture issue (problems downloading packages from the minix3 webserver) turns out to be limited to VMware, and we have a workaround (bridging mode), I think it is safe not to consider this a release blocker after all. This issue should still be fixed eventually, since 1) given that VMware works fine with other OSes, the real fault is likely with the lwIP TCP/IP stack and not with VMware's, and 2) other (less common) TCP/IP stacks may be just as restrictive as VMware's, in which case lwIP and thus MINIX would now have trouble talking to those stacks as well.

However, I have taken a look at the details and so far it seems that the solution is not exactly trivial. My current understanding is that while VMware's TCP/IP stack makes the hard assumption that zero-window probes will never be accepted as real data, while in lwIP there are at least two scenarios where such probes are in fact accepted as real data, causing a desynchronization that results in the stalls.

andrusky commented 7 years ago

Just confirming that using bridged mode works for me as well.

dcvmoole commented 7 years ago

Upon ever closer inspection of the packet logs and checking with standards and common practices etc, I can only conclude that this is entirely the fault of VMware's NAT facilities. While lwIP, or our configuration of it, probably does some not-so-standard stuff that triggers the problem here, it is the VMware side that appears to be actually buggy, specifically by violating the TCP specification. This leaves us in a tough spot: we cannot change lwIP to behave more like other implementations (or rather, we cannot reasonably expect such changes to be accepted by the lwIP maintainers, nor we want to maintain such things ourselves) while it is also very unlikely that we can convince VMware to fix their bug, especially considering Workstation is.. no longer a priority for them.

In summary, we can do little to deal with this issue aside from documenting the "use bridging rather than NAT" part and perhaps "do not use VMware if you don't have to", so I am removing the 'bug' label here as well. I am leaving open the issue until we're happy with the way it's taken care of, and if I find some more time I'll try to do some more analysis to make sure that my assessment is indeed correct.

JiiPee74 commented 7 years ago

I'm facing same issue with Hyper-V under Win 10 pro

Installation went just fine, had to add legacy nic to config and got network up.

shtjonathan commented 7 months ago

A bit late to the party, but I just might have gotten a similar issue... Not during "pkgin update" but during "pkgin_all" with Minix 3.4 rc6 on VirtualBox 7.0.14 on macOS 14.2.1.

Minix installation went fine, but while running pkgin_all, I had to walk over to another campus building (keeping my laptop open and running). When I arrived, I noticed that the pkgin_all process stalled at 66% of downloading tex-lh-doc-3.5g.tgz. So pkgin_all ran fine for many hours, all the way until it was around 75% finished...

Maybe during my walk I lost the wifi internet connection for a short while, which might have triggered the stall? Or maybe it got stalled when my host OS screen got locked and I had to login again?