docker / machine

Machine management for a container-centric world
https://docs.docker.com/machine/
Apache License 2.0
6.63k stars 1.97k forks source link

virtualbox driver in timeout state after several minutes of VM use #3599

Open zdexter opened 8 years ago

zdexter commented 8 years ago

Environment

docker docker-machine osx virtualbox
1.11.2, build b9f10c9 0.7.0, build a650a40 El Capitan 10.11.5 5.0.20 r106931

Network Information

interface subnet netmask
en0 192.168.99.0/24 0xffffff00
vboxnet0 192.168.128.0/24 0xffffff00

I have various other network interfaces, but am not aware that any others are relevant. None overlap with the 192.168.99.0/24 subnet.

Reproduction steps

  1. docker-machine start my-machine
  2. interact with the machine via the osx-based Docker client for 20-30 minutes

... then I can observe the machine in a Timeout state during docker-machine ls. active docker exec sessions print something like:

read tcp 192.168.99.1:52922->192.168.99.100:2376: read: no route to host

Other possible causes

Sleeping the host seems to make this happen more often, and more quickly than 20-30min. Sometimes I use the VM after returning from a sleep before it enters a Timeout state, and sometimes it's already in a Timeout state.

Past reports

This PR has a unit test that asserts that docker-machine should raise an error in a variety of subnet-overlap cases, but I don't have any other subnets besides vboxnet0 that use 192.168.99.0/24. https://github.com/docker/machine/pull/3165

Known temporary fixes

Restarting the osx host results in a fix that lasts 20min-1hr.

rmb938 commented 8 years ago

I am having the same issues except using docker 1.12 and machine 0.8.0 build b85aac1

rmb938 commented 8 years ago

After some testing it seems like it has something to do with the NAT network within virtualbox failing. It seems to be triggered at least in my case by "large" downloads and no amount of restarting my machine or the vm seems to be able to get around the issue.

zdexter commented 8 years ago

in my case, restarting the vm's host fixes the issue. no observed correlation with downloads.

rmb938 commented 8 years ago

Well restarting the vm works until I do something more complex then ping inside a container. Then 85-90% of the time anything that has to go over the NAT network dies.

It seems like it is a issue with the virtualbox NAT network but I don't really know a good way to test.

rmb938 commented 8 years ago

I downgraded to use the 1.11.2 boot2docker image and the issue is fixed. It seems like there is a network driver issue in the newer kernels. @zdexter can you confirm this on your end?