coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
146 stars 30 forks source link

Hang writing to NFS share on OSX + VirtualBox + vagrant #2069

Open btalbot opened 7 years ago

btalbot commented 7 years ago

Issue Report

Bug

In development environments using VirtualBox and Vagrant on OSX, writing to an NFS share hangs under some write loads. The hang occurs with the current alpha (1478.0.0), and beta (1465.2.0) but NOT with the current stable (1409.7.0).

The procedure shown below is just one way to hang the process and works 100% of the time for me. Anything that writes to the NFS share can hang: untar a tarball, using ruby bundle install, etc. The hang happens with no containers and when writing from docker containers.

Container Linux Version

$ cat /etc/os-release
ID=coreos
VERSION=1478.0.0
VERSION_ID=1478.0.0
BUILD_ID=2017-07-19-0038
PRETTY_NAME="Container Linux by CoreOS 1478.0.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

$ vagrant --version
Vagrant 1.9.7
$ VBoxManage --version
5.1.24r117012
$ uname -a
Darwin btalbot-lt 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64

Expected Behavior

The write completes successfully.

Actual Behavior

The write process hangs and cannot be killed. If the write is from a container process, the container cannot be stopped.

Reproduction Steps

  1. From someplace in $HOME: git clone https://github.com/coreos/coreos-vagrant
  2. cd coreos-vagrant
  3. echo '$share_home=true' > config.rb
  4. vagrant up
  5. vagrant ssh -c "cd $PWD && mkdir hangme && cp -rv /usr/share/ hangme/"

Other Information

If the NFS mount options are tuned to be "soft" and timeout, I've seen the writing process timeout trying to close the file. Other work-arounds for similar osx-vagrant nfs permissions issues seem to have no effect -- ls -alR > /dev/null makes no difference.

The hang happens with older versions of VirtualBox and vagrant as well, but seemingly not with older versions of CoreOS.

The same steps above when run using the current stable branch completes successfully. This can be tried by adding a "3.5" step to the above as sed -i '' -e 's/alpha/stable/' Vagrantfile.

crawford commented 7 years ago

Thanks for the report. We could use some help narrowing down which version of Container Linux actually broke (that should make it easier to pinpoint what we changed that may have broken this). Are you able to walk through the Alpha releases and figure out which was the latest working and oldest broken release?

btalbot commented 7 years ago

Are you able to walk through the Alpha releases and figure out which was the latest working and oldest broken release?

Yes. On these images, the reproducer completes successfully.

What doesn't work? Anything after 1465 which seems to include both kernel 4.12 and ignition 0.17. The reproducer hangs during coping files to the share.

bgilbert commented 7 years ago

Tentatively assuming this is a kernel problem.

btalbot commented 7 years ago

This hang still occurs using 1520.1.0 (with kernel 4.13) released today.