Exposed ports become unresponsive after heavy load

micdah commented 7 years ago

Michael Friis directed me to submit an issue here (see issue 30400 for more)

I am experiencing an intermittent issue with Docker for Windows, where suddenly all the exposed ports become unresponsive, no connection can be made to the containers. This happens when a lot of activity is put on the containers from the host machine, I am running 4 containers and 11 services on the host machine as well as a handful of websites and API's which all interact with the containers.

How to reproduce

As requested by Michael Friis, I have made some sample code which seems to be able to reproduce the issue. You can see and clone the code here github.com/micdah/DockerBomb. I have also made a YouTube video where I demonstrate the issue using my sample code youtube.com/watch?v=v5k1D60h0zE

I have described how to use the program in the readme.md file in the github repo. Note that it might take anywhere from a few minutes to minutes before the issue triggers, it is somewhat random - likely because it is tightly timing related

The sample program creates the requested number of threads, each creating a single connection to the redis container and issuing as many commands as possible until the connection fails.

As demonstrated, when the issue has occurred the container becomes unresponsive on the exposed ports, although it is still running. Trying to restart the container results in an input/output error when trying to bind to the host port. In my previous issue report (30400) I have also included a netstat dump to show that it is not because the port is reserved, when trying to restart the container, that it fails.

Expected behavior

I would expect the container to continue to be accessible via the exposed ports, as long as it is running. If some resource pool (handles, connection pool, etc.) is exhausted, I would expect the container to become responsive again when the resources become available again (for example when stopping the heavy load on the container).

Information

Diagnostic ID This is a diagnostic uploaded, just after the issue has occurred, reproduced as described above.

30667474-C49F-4185-B957-3A7AE1F38393/2017-01-24_21-44-30

Output of docker version

Client:
 Version:      1.13.0
 API version:  1.25
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Wed Jan 18 16:20:26 2017
 OS/Arch:      windows/amd64

Server:
 Version:      1.13.0
 API version:  1.25 (minimum version 1.12)
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Wed Jan 18 16:20:26 2017
 OS/Arch:      linux/amd64
 Experimental: true

Output of docker info

Containers: 5
 Running: 1
 Paused: 0
 Stopped: 4
Images: 6
Server Version: 1.13.0
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 2f7393a47307a16f8cee44a37b262e8b81021e3e
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.4-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.837 GiB
Name: moby
ID: 5DLJ:7BM4:KTMA:L5UV:ACM5:HJQP:V2W3:ZQXJ:LUS5:XEVE:FJK2:KH5K
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 21
 Goroutines: 28
 System Time: 2017-01-24T20:49:08.8436128Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

micdah commented 7 years ago

I have tested the sample code on my Mac running the newest version of docker as well, and the issue doesn't seem to appear there. Although I am still experiencing intermitten connection failure across all threads at once - but here it recovers immediately and continues to work, no fuss. Whereas on Windows, the port is dead and cannot be used again before Docker has been restarted.

rn commented 7 years ago

@micdah Thanks for the very detailed repro (probably the most detailed I've seen sofar!).

Just for my record, I had to install the dotnet CLI tools from here to compile and run the stress test.

The issue seems to be with a component called VPNKit. We run some stress tests against it in CI but your case seems to put an heavier load on it.

In the logs I see a lot of messages like this:

[21:42:21.456][VpnKit         ][Error  ] com.docker.slirp.exe: tcp:0.0.0.0:6379:tcp:172.19.0.2:6379 proxy failed with flow proxy a: write failed with Eof

and this:

[19:18:11.347][VpnKit         ][Error  ] com.docker.slirp.exe: Socket.Stream: caught A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

and this:

[15:05:33.976][VpnKit         ][Error  ] com.docker.slirp.exe: Hvsock.read: An established connection was aborted by the software in your host machine.

With Beta 39, which has an updated version of VPNKit, I can also cause issue, albeit slightly different:

[13:49:44.377][VpnKit         ][Error  ] Process died
[13:49:49.381][VpnKit         ][Info   ] Starting C:\Program Files\Docker\Docker\Resources\com.docker.slirp.exe --ethernet hyperv-connect://acaef826-d7a2-41db-9685-1315c13f9a40 --port hyperv-connect://acaef826-d7a2-41db-9685-1315c13f9a40 --db \\.\pipe\dockerDataBase --debug --diagnostics \\.\pipe\dockerVpnKitDiagnostics
[13:49:49.383][VpnKit         ][Info   ] Started
[13:49:49.418][VpnKit         ][Info   ] com.docker.slirp.exe: Logging to stdout (stdout:true DEBUG:false)
[13:49:49.418][VpnKit         ][Info   ] com.docker.slirp.exe: Setting handler to ignore all SIGPIPE signals
[13:49:49.418][VpnKit         ][Info   ] com.docker.slirp.exe: vpnkit version %VERSION% with hostnet version  %HOSTNET_PINNED% uwt version 0.0.3 hvsock version 0.13.0 %HVSOCK_PINNED%
[13:49:49.418][VpnKit         ][Info   ] com.docker.slirp.exe: starting port forwarding server on port_control_url:hyperv-connect://acaef826-d7a2-41db-9685-1315c13f9a40 vsock_path:
[13:49:49.418][VpnKit         ][Info   ] com.docker.slirp.exe: connecting to acaef826-d7a2-41db-9685-1315c13f9a40:0B95756A-9985-48AD-9470-78E060895BE7
[13:49:49.421][VpnKit         ][Info   ] com.docker.slirp.exe: connecting to acaef826-d7a2-41db-9685-1315c13f9a40:30D48B34-7D27-4B0B-AAAF-BBBED334DD59
[13:49:49.421][VpnKit         ][Error  ] com.docker.slirp.exe: While watching /etc/resolv.conf: ENOENT
[13:49:49.421][VpnKit         ][Info   ] com.docker.slirp.exe: hosts file has bindings for 
[13:49:49.422][VpnKit         ][Info   ] com.docker.slirp.exe: hvsock connected successfully
[13:49:49.422][VpnKit         ][Info   ] com.docker.slirp.exe: hvsock connected successfully
[13:49:49.422][VpnKit         ][Info   ] com.docker.slirp.exe: attempting to reconnect to database
[13:49:49.422][DataKit        ][Info   ] com.docker.db.exe: accepted a new connection on \\.\pipe\dockerDataBase
[13:49:49.422][DataKit        ][Info   ] com.docker.db.exe: Using protocol TwoThousandU msize 8215
[13:49:49.426][VpnKit         ][Info   ] com.docker.slirp.exe: reconnected transport layer
[13:49:49.426][VpnKit         ][Info   ] com.docker.slirp.exe: remove connection limit
[13:49:49.432][VpnKit         ][Info   ] com.docker.slirp.exe: allowing binds to any IP addresses
[13:49:49.433][VpnKit         ][Info   ] com.docker.slirp.exe: updating resolvers to nameserver 8.8.8.8#53
[13:49:49.433][VpnKit         ][Info   ] order 0
[13:49:49.433][VpnKit         ][Info   ] nameserver 8.8.4.4#53
[13:49:49.433][VpnKit         ][Info   ] order 0
[13:49:49.433][VpnKit         ][Info   ] com.docker.slirp.exe: Add(3): DNS configuration changed to: nameserver 8.8.8.8#53
[13:49:49.433][VpnKit         ][Info   ] order 0
[13:49:49.433][VpnKit         ][Info   ] nameserver 8.8.4.4#53
[13:49:49.433][VpnKit         ][Info   ] order 0
[13:49:49.435][VpnKit         ][Info   ] com.docker.slirp.exe: Creating slirp server pcap_settings:disabled peer_ip:192.168.65.2 local_ip:192.168.65.1 domain_search: mtu:8000
[13:49:49.435][VpnKit         ][Info   ] com.docker.slirp.exe: PPP.negotiate: received ((magic VMN3T)(version 13)(commit"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"))
[13:49:49.436][VpnKit         ][Info   ] com.docker.slirp.exe: TCP/IP ready
[13:49:49.436][VpnKit         ][Info   ] com.docker.slirp.exe: stack connected
[13:49:49.436][VpnKit         ][Info   ] com.docker.slirp.exe: no introspection server requested. See the --introspection argument
[13:49:49.436][VpnKit         ][Info   ] com.docker.slirp.exe: starting diagnostics server on: \\.\pipe\dockerVpnKitDiagnostics

So, this actually crashes the process and the restart does not seem to work.

On a subsequent run (after restart) I got:

[13:58:45.123][VpnKit         ][Error  ] com.docker.slirp.exe: Hvsock.read: Lwt_stream.Closed
[13:58:48.312][VpnKit         ][Error  ] com.docker.slirp.exe: Hvsock.read: Lwt_stream.Closed
[13:58:48.312][VpnKit         ][Error  ] com.docker.slirp.exe: Socket.Stream: caught Lwt_stream.Closed
[13:59:30.685][VpnKit         ][Info   ] Thread 2950 killed on uncaught exception Assert_failure("src/core/lwt.ml", 497, 9)
[13:59:30.685][VpnKit         ][Info   ] Raised at file "uwt_preemptive.ml", line 384, characters 23-26
[13:59:30.685][VpnKit         ][Info   ] Called from file "lwt_unix/lwt_hvsock_main_thread.ml", line 49, characters 6-206
[13:59:30.685][VpnKit         ][Info   ] Called from file "thread.ml", line 39, characters 8-14

With beta38, I got a different crash:

Version: 1.13.0-beta38 (9805)
Channel: Beta
Sha1: 9c31a154d11ccf6a29f009610453eab4921bc6e8
[...]
[13:23:50.020][VpnKit         ][Error  ] com.docker.slirp.exe: Lwt.async failure "Assert_failure src/core/lwt.ml:497:9": Raised at file "src/core/lwt.ml", line 497, characters 9-21
[13:23:50.020][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 201, characters 8-15
[13:23:50.020][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 201, characters 8-15
[13:23:50.020][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 201, characters 8-15
[13:23:50.020][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 201, characters 8-15
[13:23:50.020][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 305, characters 2-34
[13:23:50.020][VpnKit         ][Info   ] Called from file "uwt.ml", line 1669, characters 23-44
[13:23:50.020][VpnKit         ][Info   ]

micdah commented 7 years ago

@rneugeba Ah yes, I forgot to mention that you needed .net core tools, that is my bad. :-) As this is such an intermittent issue, I figured a consistent way of reproducing it would be useful in diagnosing and identifying the issue. Also I was curious whether the issue was specific to one of the services we are developing (normally these don't run under Docker, just for development at the moment), or if I could reproduce it with the most basic of application code.

Very interesting findings you have made regarding the probable component - curious that the logs show different parts failing in different versions.

rn commented 7 years ago

The VPNKit code received significant updates since 1.13.0 stable both in the DNS handling code as well as a on the main data path (including some performance improvements) so I'm not surprised that it fails in different ways.

Your test code seems to put a considerable stress on it...If I understand the code, it tries to continuously open up to 500 connections (depending on the number entered) connections to container running redis and then continuously issues requests to the container.

As mentioned, we have some similar style tests in our regression suite, but they are slightly different

micdah commented 7 years ago

Yes that is correct, my first instinct was that it was related to either the number of concurrent connections and/or their activity/throughput.

So my first attempt to reproduce the issue, was having a configurable number of continuously open connections with high activity.

In my example code, I make no attempt to re-connect if a connection fails - so the connections seem to be stable, until all connectivity fails at once for all open connections.

My findings are that the number of connections doesn't seem to affect if the issue can occur, but does seem to prolong the wait before the issue occurs. So my expectation would be, that a very specific timing between two or more concurrent connections seem to trigger the issue, thus the more concurrent connections and the more concurrent activity there are, the more likely the issue is to occur.

But to me, of course, this is just one big black box - so I might be totally off track. 😄

atmorell commented 7 years ago

I am seeing exactly the same behavior. VPNkit crashes under heavy load, and all containers stops responding. Restarting Docker makes everything great again ;)

dmesg on the docker VM in Hyper-V shows the following error:
docker run --rm --privileged debian:jessie dmesg
[ 90.463682] device veth719b2e2 entered promiscuous mode
[ 90.463788] IPv6: ADDRCONF(NETDEV_UP): veth719b2e2: link is not ready
[ 90.463788] docker0: port 3(veth719b2e2) entered blocking state
[ 90.463788] docker0: port 3(veth719b2e2) entered forwarding state
[ 90.464119] docker0: port 3(veth719b2e2) entered disabled state
[ 90.539736] IPVS: Creating netns size=2104 id=12
[ 90.539739] IPVS: ftp: loaded support on port[0] = 21
[ 90.630341] eth0: renamed from vethc8fadb9
[ 90.680467] IPv6: ADDRCONF(NETDEV_CHANGE): veth719b2e2: link becomes ready
[ 90.680498] docker0: port 3(veth719b2e2) entered blocking state
[ 90.680498] docker0: port 3(veth719b2e2) entered forwarding state
[ 167.381158] docker0: port 4(vethb7059d0) entered blocking state
[ 167.381160] docker0: port 4(vethb7059d0) entered disabled state

dgageot commented 7 years ago

@djs55 Could you please take a look at the diagnostics?

fc0712 commented 7 years ago

I'm experiencing the same

Will include log when I'm back at my Pc

Log: 0D785797-BD32-437E-9A2D-B634D46D26BE/2017-02-03_09-12-29

Check around 6AM-6:10

Example:

Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede.

English:

An attempt to connect failed because the party had connected, did not respond properly after a period of time, or established connection was terminated because the host did not respond.

:06:29.373][VpnKit [06:06:29.568][VpnKit [06:06:29.632][VpnKit [06:06:29.893][VpnKit [06:06:29.995][VpnKit [06:06:31.686][VpnKit [06:06:31.686][VpnKit [06:06:31.686][VpnKit [06:06:32.568][VpnKit [06:06:32.632][VpnKit [06:06:32.995][VpnKit [06:06:33.039][VpnKit [06:06:33.039][VpnKit [06:06:33.039][VpnKit [06:06:33.172][VpnKit [06:06:34.052][VpnKit [06:06:34.052][VpnKit [06:06:34.052][VpnKit [06:06:34.318][VpnKit [06:06:34.355][VpnKit [06:06:34.373][VpnKit [06:06:34.568][VpnKit [06:06:34.632][VpnKit [06:06:34.996][VpnKit [06:06:36.322][VpnKit [06:06:36.322][VpnKit [06:06:36.322][VpnKit [06:06:37.172][VpnKit [06:06:37.427][VpnKit [06:06:37.427][VpnKit [06:06:37.427][VpnKit [06:06:38.319][VpnKit [06:06:38.355][VpnKit [06:06:38.375][VpnKit [06:06:38.449][VpnKit [06:06:38.449][VpnKit [06:06:38.449][VpnKit [06:06:47.319][VpnKit [06:06:47.356][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 497810748 ][Info ] Tcp.Segment: TCP retransmission on timer seq = 496860497 ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499463544 ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499447550 ][Info ] Tcp.Segment: TCP retransmission on timer seq = 496118189 ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede. ][Info ] ][Info ] ][Info ] Tcp.Segment: TCP retransmission on timer seq = 497074625 ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499626264 ][Info ] Tcp.Segment: TCP retransmission on timer seq = 496319177 ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede. ][Info ] ][Info ] ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498814677 ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede. ][Info ] ][Info ] ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498858170 ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498820699 ][Info ] Tcp.Segment: TCP retransmission on timer seq = 497927573 ][Info ] Tcp.Segment: TCP retransmission on timer seq = 497074625 ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499626264 ][Info ] Tcp.Segment: TCP retransmission on timer seq = 496319177 ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede. ][Info ] ][Info ] ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498827741 ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede. ][Info ] ][Info ] ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498869850 ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498833839 ][Info ] Tcp.Segment: TCP retransmission on timer seq = 497941981 ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede. ][Info ] ][Info ] ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499062078 ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499021687

fc0712 commented 7 years ago

Any status on this issue?

rn commented 7 years ago

we are still debugging the issue

atmorell commented 7 years ago

Is there any workaround for this crash?

micdah commented 7 years ago

Not to be pushy, but any updates on this issue?

If there is anything I can do to help (such as testing beta/development versions), I'll be happy to oblige.

atmorell commented 7 years ago

Almost thought it was solved by the latest update, but my containers became unresponsible after 1-2 hours of high load.

rn commented 7 years ago

Unfortunately, the main developer for the component in question, VPNKit, was busy with some other work but will take a look this week. @micdah your reproduction is very useful for the debug. Again thanks for the effort in putting this together.

djs55 commented 7 years ago

Sorry for the delay -- I'm planning to investigate this issue this week.

I notice in @rneugeba 's logs quoted above an assertion failed in lwt/uwt (the libraries we use on top of libuv for asychronous I/O on Windows and Mac) There is a newer version of uwt containing various bug fixes -- I'll see if that makes a difference first since I was planning an upgrade anyway.

The Mac and Windows versions of com.docker.slirp.exe process differ slightly in how they talk to the VM -- on the Mac we use a Unix domain socket while on Windows we use Hyper-V sockets. The Hyper-V socket code happens to stress the Uwt_preemptive module quite heavily which I notice is mentioned in @rneugeba 's quoted traceback. It may be possible to redesign that code to avoid that module if it turns out to be buggy.

I'll keep you informed of progress!

djs55 commented 7 years ago

I've been running the repro case on beta build number 10183 which was briefly released this week and then recalled (due to discovering a serious bug elsewhere). The test has been running for several hours with no problems. I'll leave it over the weekend to see what happens. The main significant change in this build is the libuv/uwt upgrade.

micdah commented 7 years ago

How did it go with stress testing beta build 10183? Very excited to hear news on this issue. 😄

djs55 commented 7 years ago

I left the test running for a few more days but then my machine rebooted to install patches. Could you try again with the latest version of docker? Note the version numbering scheme is now date based -- version 17.03.0-ce was released today: https://store.docker.com/editions/community/docker-ce-desktop-windows?tab=description

micdah commented 7 years ago

I have just updated my docker installation and tried to run my repro again, sadly it still seems to trigger.

This time though, at first I was able to restart the docker container without error (previously I got an exception when it tried to bind to the same port when starting the container again), but restarting the container doesn't fix the connectivity - it remains unreachable after the crash.

Output from docker -v:

Docker version 17.03.0-ce, build 60ccb22

After having restarted the container and tried running my program again to check if the container became reachable (and verifying that it remained unreachable), I tried restarting the container once more (using docker-compose restart each time), I received a new error I haven't seen before:

Error response from daemon: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.

It seems like the upgrade of VPNKit have had an effect on the issue, but doesn't seem to have fixed it.

Any docker command (other than asking for version) after the above mentioned error message appeared, results in the same message, even a simple docker ps.

I have uploaded a diagnostics dump just after reproducing the error, restarting the container once, re-running and confirming still unreachable, and restarting a second time and encountering the above new error:

30667474-C49F-4185-B957-3A7AE1F38393/2017-03-02_19-03-47

micdah commented 7 years ago

Just tried to reproduce it once more, just to verify, this time it happened again (both times, it seemed to trigger more quickly than it did previously, but that might just be imagination or random luck).

This time, after re-running the program to verify the container is unreachable, and then trying to restart the container (docker-compose restart), I received the same error message as originally:

$ docker-compose restart
Restarting dockerbomb_redis_1 ... error

ERROR: for dockerbomb_redis_1  Cannot restart container 96579f13ce0f4322817342186728a646e3a33af3e7b6c91ead73a6becafb0742: driver failed programming external connectivity on endpoint dockerbomb_redis_1 (64d3c0dada115d7f483974802d85f3a8ff5851e4469b9c61db63c4cd0c4ae646): Error starting userland proxy: mkdir /port/tcp:0.0.0.0:6379:tcp:172.19.0.2:6379: input/output error

Odd that it let me restart the container, first time around, but not this time. Might just be due to the randomness of the undefined state it might be in, when the bug triggers.

Here is a diagnostics dump from my second reproduction:

30667474-C49F-4185-B957-3A7AE1F38393/2017-03-02_19-13-05

rn commented 7 years ago

@micdah thanks for testing. is there anything in the logs? Look for the ones prefixed with vpnkit.

I was able to reproduce it easily with earlier versions, but currently travelling so can't retest ATM

micdah commented 7 years ago

@rneugeba Just re-ran the test again, so I could pinpoint the exact part of the log relevant to the point when the error occurs.

This is what was output, just when the error triggered:

[19:36:07.862][VpnKit         ][Error  ] Process died
[19:36:12.862][VpnKit         ][Info   ] Starting C:\Program Files\Docker\Docker\Resources\com.docker.slirp.exe --ethernet hyperv-connect://f08ee3f6-1270-41b8-8bd4-1874485c066a --port hyperv-connect://f08ee3f6-1270-41b8-8bd4-1874485c066a --db \\.\pipe\dockerDataBase --debug --diagnostics \\.\pipe\dockerVpnKitDiagnostics
[19:36:12.863][VpnKit         ][Info   ] Started
[19:36:12.883][VpnKit         ][Info   ] com.docker.slirp.exe: Logging to stdout (stdout:true DEBUG:false)
[19:36:12.883][VpnKit         ][Info   ] com.docker.slirp.exe: Setting handler to ignore all SIGPIPE signals
[19:36:12.883][VpnKit         ][Info   ] com.docker.slirp.exe: vpnkit version c41f7c8589352c95b14de636c895e8fbd72222e5 with hostnet version   uwt version 0.0.3 hvsock version 0.13.0 
[19:36:12.883][VpnKit         ][Info   ] com.docker.slirp.exe: starting port forwarding server on port_control_url:hyperv-connect://f08ee3f6-1270-41b8-8bd4-1874485c066a vsock_path:
[19:36:12.883][VpnKit         ][Info   ] com.docker.slirp.exe: connecting to f08ee3f6-1270-41b8-8bd4-1874485c066a:0B95756A-9985-48AD-9470-78E060895BE7
[19:36:12.885][VpnKit         ][Info   ] com.docker.slirp.exe: connecting to f08ee3f6-1270-41b8-8bd4-1874485c066a:30D48B34-7D27-4B0B-AAAF-BBBED334DD59
[19:36:12.885][VpnKit         ][Error  ] com.docker.slirp.exe: While watching /etc/resolv.conf: ENOENT
[19:36:12.885][VpnKit         ][Info   ] com.docker.slirp.exe: hosts file has bindings for client.openvpn.net ##URL-REMOVED-BY-MICDAH##
[19:36:12.886][VpnKit         ][Info   ] com.docker.slirp.exe: hvsock connected successfully
[19:36:12.886][VpnKit         ][Info   ] com.docker.slirp.exe: hvsock connected successfully
[19:36:12.886][VpnKit         ][Info   ] com.docker.slirp.exe: attempting to reconnect to database
[19:36:12.886][DataKit        ][Info   ] com.docker.db.exe: accepted a new connection on \\.\pipe\dockerDataBase
[19:36:12.886][DataKit        ][Info   ] com.docker.db.exe: Using protocol TwoThousandU msize 8215
[19:36:12.888][VpnKit         ][Info   ] com.docker.slirp.exe: reconnected transport layer
[19:36:12.888][VpnKit         ][Info   ] com.docker.slirp.exe: remove connection limit
[19:36:12.898][VpnKit         ][Info   ] com.docker.slirp.exe: allowing binds to any IP addresses
[19:36:12.900][VpnKit         ][Info   ] com.docker.slirp.exe: updating resolvers to nameserver 192.168.1.1#53
[19:36:12.900][VpnKit         ][Info   ] order 0
[19:36:12.900][VpnKit         ][Info   ] com.docker.slirp.exe: Add(3): DNS configuration changed to: nameserver 192.168.1.1#53
[19:36:12.900][VpnKit         ][Info   ] order 0
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: Creating slirp server peer_ip:192.168.65.2 local_ip:192.168.65.1 domain_search: mtu:8000
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: PPP.negotiate: received ((magic VMN3T)(version 13)(commit"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"))
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: TCP/IP ready
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: stack connected
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: no introspection server requested. See the --introspection argument
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: starting diagnostics server on: \\.\pipe\dockerVpnKitDiagnostics

There are no logs prior to this, I noted the line of the output before starting my program. So this is the entirety of the log output when the error triggers.

Is there any way to increase verbosity of the output, seems odd the vpnkit process would just die without any error or exception

atmorell commented 7 years ago

Crashing after a few minutes with the new build. Trying to start anything gives the following error:

Error response from daemon: driver failed programming external connectivity on endpoint sonarr (06921e1a5b924edb2347fde1b1e0ed18707e2508cb0052e090dbcaae930fa05b): Error starting userland proxy: mkdir /port/tcp:0.0.0.0:8989:tcp:172.17.0.4:8989: input/output error Error: failed to start containers: sonarr

micdah commented 7 years ago

Yeah, today while trying to demonstrate a development environment I have been trying out for the past few months on my machine, to a few colleagues, Docker simply wouldn't start - it sat for a few minutes before showing an error message and the icon became red.

Sadly I was in a hurry, so I didn't have time to try and find something in the logs - and just tried starting docker again and now it works (the machine have been restarted).

Of course, these things will happen, but it is the first time I have ever experienced that failure.

bluevulpine commented 7 years ago

I've got a 'me, too!' on this.

I'm running Minecraft in a docker container so there's a lot of network activity out to the four people connected and this is happening 2-3 times a night when we're all connected. When this happens everyone gets booted, so it's easy to know when to go check the logs. The docker logs look the same as micdah - VpnKit posts a process died message, then proceeds through its restart. I can usually restore things to working condition by triggering a complete docker restart.

This is occurring with Docker Community Edition, Version 17.03.0-ce-win1 (10296), Channel: stable, e5a07a1, running on Windows 10 version 1607 build 14393.693.

Diagnostic:

0F9BB5CA-F560-4D34-B4A4-881C5409B32C/2017-03-05_03-41-01

atmorell commented 7 years ago

Any progress @djs55?

micdah commented 7 years ago

Hey, it's been a couple of weeks now - don't want to pressure anybody, but an update on this issue would be greatly appreciated. :-)

friism commented 7 years ago

@micdah as a workaround until this is fixed, can you run your integration tests on a stand-alone linux-machine or on a vm created with docker-machine. These won't exhibit the problem.

docker-machine create -d hyperv --hyperv-virtual-switch <some-virtual-switch-you-created>

SC7639 commented 7 years ago

I am having the same issue with my nginx container not accepting connections (seems to have frozen) and when I try to restart the container either using docker start vin_web_1 or docker-compose up I get the following error:

Error response from daemon: driver failed programming external connectivity on endpoint vin_web_1 (013d29d8164ed5655c4b193c0198a8e9c4cafb163912326f2c19887f1d51b79e): Error starting userland proxy: mkdir /port/tcp:0.0.0.0:80:tcp:172.18.0.3:80: input/output error
Error: failed to start containers: vin_web_1

docker version info:

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 10:40:59 2017
 OS/Arch:      windows/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   3a232c8
 Built:        Tue Feb 28 07:52:04 2017
 OS/Arch:      linux/amd64
 Experimental: true

I've also uploaded a diagnostic with id: 7CE00045-6608-4DA4-9E8C-4479902B87FA/2017-03-21_11-50-37

If any other information is required to help with this issue let me know and I will provide it.

Restarting docker for windows sometimes solved the issue. Just this afternoon I had to completely re-install docker for windows. Another fix I've found is closing a bunch of programs so I'm using under 50% of CPU

Naragato commented 7 years ago

Myself and several colleagues are experiencing the same issue. Restarting docker then the containers temporarily resolves the issue.

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 10:40:59 2017
 OS/Arch:      windows/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   3a232c8
 Built:        Tue Feb 28 07:52:04 2017
 OS/Arch:      linux/amd64
 Experimental: true

A diagnostic was uploaded with id: D5E01A54-0F2C-479B-8E3B-F5CD022D8D92/2017-03-22_14-47-30

AngryBeaver commented 7 years ago

same here

mittork commented 7 years ago

Any progress @djs55 ?

tparikka commented 7 years ago

I have also been encountering this problem. Restarting Docker and the containers has worked for me as well. I am using Docker for end-to-end web app UI tests with a single container running selenium/hub:3.2.0 and ten containers running selenium/node-chrome:3.2.0. spun up with docker-compose. The hub container seems to be the one that bails out, and it tends to do so between 5 and 15 minutes from the time I start running the test fixtures.

Client: Version: 17.03.1-ce API version: 1.27 Go version: go1.7.5 Git commit: c6d412e Built: Tue Mar 28 00:40:02 2017 OS/Arch: windows/amd64

Server: Version: 17.03.1-ce API version: 1.27 (minimum version 1.12) Go version: go1.7.5 Git commit: c6d412e Built: Fri Mar 24 00:00:50 2017 OS/Arch: linux/amd64 Experimental: true

I've uploaded diagnostic 7EBC5B9F-3979-420B-B185-E677DE90285E/2017-03-31_17-22-24.

Naragato commented 7 years ago

Still no-one assigned to the bug, it's pretty bad. The issue has existed for quite a while now....

bvitale commented 7 years ago

We were unable to adopt Docker for Windows due to this issue. Had to revert to a VM based approach.

markopadjen commented 7 years ago

Fix this pls.

tparikka commented 7 years ago

@rneugeba @djs55 Is it possible to provide an update on progress, even if it's just that it is stalled waiting on resources to become available to investigate? The last real update I see was a suggested workaround by @friism 27 days ago, but there hasn't been any response on the submitted diagnostics since March 2. Thank you for your time!

jeanlaurent commented 7 years ago

Hi everyone,

Sorry for the lack of response, we don't have a clear answer yet, @djs55 tried the reproduce the example which ran for several days without seeing the problem.

We're going to re-run that on a spare system, setup for that issue specifically for a longer time again and see what comes out of it.

We will publish the result in here.

Thanks for your patience.

micdah commented 7 years ago

@jeanlaurent From your last comment, it sounded like the reproduction didn't work anymore (being able to run for days without encountering the issue).

So to check whether the reproduction code was still relevant, I tried re-running it with my current version of Docker for windows (latest). Tried a total of three times, and the issue occurred each time.

I noticed that it seemed more "random" when it would hit the issue, first run ran for about a minute before dying. Second time took well over 10 minutes, and third time it failed even before all threads were up and running.

Each time I ran with 500 threads.

mattjslack commented 7 years ago

Same issue occurs for me, running a single cloudera quickstart docker container, under medium network load, all network ports die within half hour to a few hours. The error messages in the log files are the same as those reported.

djs55 commented 7 years ago

@micdah sorry for the delay getting back to you. I've been running the example program with 500 threads for a little over 24 hours now with no sign of a problem. I'm using the latest edge version:

Version 17.05.0-ce-rc1-win8 (11189)
Channel: edge
73d01bb

on Windows version 1607 (OS Build 14393.953).

Could you upgrade to the latest edge version, reproduce the problem and upload a fresh diagnostic?

SC7639 commented 7 years ago

I've found that it only happens when my cpu is at 100% for more than a couple of hours. I found what was causing the massive cpu usage and I've not had the issue since

tparikka commented 7 years ago

@djs I installed the edge release and ran through my full suite of Selenium tests and did not encounter failures, so it seems good so far on my end.

EDIT: Appears I spoke too early, I just ran into the issue again this morning.

micdah commented 7 years ago

@djs55 I've just updated my docker to edge

# docker -v
Docker version 17.05.0-ce-rc1, build 2878a85

Running my example code, after approx 10 minutes, all the connections died. Here is a diagnostic id

30667474-C49F-4185-B957-3A7AE1F38393/2017-04-21_07-52-01

Restarting the container, doesn't do anything either. A restart of docker is still needed.

djs55 commented 7 years ago

I still see an error like this in the logs:

[07:49:21.959][VpnKit         ][Error  ] vpnkit.exe: Lwt.async failure "Assert_failure src/core/lwt.ml:500:9": Raised at file "src/core/lwt.ml", line 500, characters 9-21
[07:49:21.960][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 202, characters 8-15
[07:49:21.960][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 202, characters 8-15
[07:49:21.960][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 202, characters 8-15
[07:49:21.960][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 202, characters 8-15
[07:49:21.960][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 207, characters 8-15
[07:49:21.960][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 290, characters 6-42
[07:49:21.960][VpnKit         ][Info   ] Called from file "uwt.ml", line 1798, characters 23-44

Do you know if all network traffic stops, or is it only ports? I notice these errors too:

[07:51:35.026][VpnKit         ][Error  ] vpnkit.exe: Upstream DNS server 8.8.4.4:53 has dropped 6 packets in a row: assuming it's offline
[07:51:35.026][VpnKit         ][Error  ] vpnkit.exe: Upstream DNS server 8.8.8.8:53 has dropped 6 packets in a row: assuming it's offline
[07:51:35.026][VpnKit         ][Error  ] vpnkit.exe: Upstream DNS server 192.168.0.2:53 has dropped 6 packets in a row: assuming it's offline

-- this could be real or it could be a cascade failure.

micdah commented 7 years ago

It was only the ports that stopped, I was working on the machine while running the test and I didn't notice any network outage.

I did however have an active VPN connection to our Amazon environment, but only traffic to/from our servers are routed through that, the rest goes straight out the house.

Bazmcl commented 7 years ago

seem to be suffering from the same issue - not opening a lot of ports but a lot of traffic

Uploaded diagnostic EB0E3EF6-6538-4B43-ADB9-8CFE688B9BBF/2017-05-04_09-26-08

tparikka commented 7 years ago

@djs55 Has there been any new information unearthed since @micdah confirmed it was only the ports that stopped in his environment?

bvitale commented 7 years ago

A diagnostic was uploaded with id: 0BFBE273-80BD-45FB-AD1A-36352590E93D/2017-05-16_13-44-49

[13:41:50.640][VpnKit         ][Warning] vpnkit.exe: ARP table has no entry for 172.18.0.2
[13:41:50.640][VpnKit         ][Error  ] vpnkit.exe: PPP.listen callback caught Ipv4.Make(Ethif)(Arpv4).Routing.No_route_to_destination_address(_)
[13:42:02.762][VpnKit         ][Error  ] vpnkit.exe: Hvsock.shutdown_write: got Eof
[13:42:02.762][VpnKit         ][Error  ] vpnkit.exe: tcp:0.0.0.0:5000:tcp:172.18.0.15:3000 proxy failed with flow proxy b: write failed with Eof
[13:42:02.970][VpnKit         ][Error  ] vpnkit.exe: Socket.Stream: caught An existing connection was forcibly closed by the remote host.

Windows 10 14393.1066

My scenario is sending a large number of requests to the same port.

tparikka commented 7 years ago

@djs55 or @jeanlaurent Has there been any progress on this issue, and any more tests/diagnostics we can provide that would assist at this time? Thank you for your help!

docker / for-win