Docker for Mac hangs constantly

ryanfb commented 7 years ago

Expected behavior

Docker for Mac doesn't hang.

Actual behavior

Docker for Mac hangs.

Information

Full output of the diagnostics from "Diagnose & Feedback" in the menu

Docker for Mac: version: 17.06.0-ce-mac17 (4cdec4294a50b2233146b09469b49937dabdebdd)
macOS: version 10.11.6 (build: 15G1421)
logs: /tmp/BED37CCE-B2F9-41B9-B9E6-72EFEBE30091/20170707-134513.tar.gz
failure: docker ps failed: (Failure "docker ps: timeout after 10.00s")
[OK]     db.git
[OK]     vmnetd
[OK]     dns
[OK]     driver.amd64-linux
[OK]     virtualization VT-X
[OK]     app
[OK]     moby
[OK]     system
[OK]     moby-syslog
[OK]     db
[OK]     env
[OK]     virtualization kern.hv_support
[OK]     slirp
[OK]     osxfs
[OK]     moby-console
[OK]     logs
[ERROR]  docker-cli
         docker ps failed
[OK]     menubar
[OK]     disk

Diagnostic ID: BED37CCE-B2F9-41B9-B9E6-72EFEBE30091

Steps to reproduce the behavior

Clone https://github.com/dcthree/dclp-docker
Run docker-compose up --force-recreate

Get output:

Creating network "dockercompose_default" with the default driver
Creating volume "dockercompose_repo" with default driver
Creating volume "dockercompose_maven" with default driver
Creating dockercompose_fuseki_1 ...
Creating dockercompose_xsugar_1 ...
Creating dockercompose_repo_clone_1 ...
Creating dockercompose_repo_clone_1
Creating dockercompose_xsugar_1
Creating dockercompose_xsugar_1 ... done
Creating dockercompose_navigator_1 ...
Creating dockercompose_sosol_1 ...
Creating dockercompose_navigator_1
Creating dockercompose_sosol_1 ... done
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

Docker for Mac hangs until I quit/restart it. I can't run e.g. docker ps or any other command that interacts with Docker for Mac.

I've tried using docker system prune, using "Reset" in the Docker for Mac GUI, increasing RAM/CPU allocations (currently 16GB/6 CPU), and upgrading Docker for Mac to edge. I still encounter this problem multiple times per day, and now regularly/reliably when trying to start this Compose file.

simonbh commented 7 years ago

I am also having this same issue. Diag 11EED9F3-181B-41BC-A99D-BEF7DDC1580E

bcully commented 7 years ago

Same issue. I have 4 CPUs/6GB RAM allocated.

ryanfb commented 7 years ago

Still seeing this non-stop. It's sufficient to run docker-compose up indexer after cloning/updating https://github.com/dcthree/dclp-docker to make the crash happen (since there are some unversioned config files for some of the other services).

mickaelperrin commented 7 years ago

Sadly, I can confirm that since a few weeks docker for mac is far less stable than it used to be.

It randomly crashes and only a mac reboot helps to bring back the service. Manually restarting the service doesn't work.

westover commented 7 years ago

Also having this issue. However I have noticed that starting a new Diagnose process tends to unblock the process. However I am using a custom container I am building. Restarting the app does help but you have to kill both the hyperkit process and the qcow-tool process. A53027AC-01D5-42AC-BBEB-2B7C58218846

sudhagarc commented 6 years ago

Installed docker on new MBP and after mac woke up from sleep, I noticed this issue. I have not noticed this on my previous MBP.

Docker restart got hung.
Activity monitor showed hyperkit taking about 100% of cpu
Had to force quit both Docker and hyperkit to recover the situation

Docker for Mac: version: 17.09.0-ce-mac35 (69202b202f497d4b6e627c3370781b9e4b51ec78)
macOS: version 10.12.6 (build: 16G1036)
logs: /tmp/D06DD1FC-D953-476E-9381-80A47EB055D7/20171207-115540.tar.gz
failure: docker ps failed: (Failure "docker ps: timeout after 10.00s")
[OK]     db.git
[OK]     vmnetd
[OK]     dns
[OK]     driver.amd64-linux
[OK]     virtualization VT-X
[OK]     app
[OK]     moby
[OK]     system
[OK]     moby-syslog
[OK]     db
[OK]     env
[OK]     virtualization kern.hv_support
[OK]     slirp
[OK]     osxfs
[OK]     moby-console
[OK]     logs
[ERROR]  docker-cli
         docker ps failed
[OK]     menubar
[OK]     disk

Diagnostic ID: D06DD1FC-D953-476E-9381-80A47EB055D7

akimd commented 6 years ago

Please, try a more recent version of Docker for Mac.

ryanfb commented 6 years ago

@akimd I just tried again with the latest edge, 18.01.0-ce-mac48 (220004). The exact same thing still happens.

Docker for Mac: version: 18.01.0-ce-mac48 (d1778b704353fa5b79142a2055a2c11c8b48a653)
macOS: version 10.12.6 (build: 16G1114)
logs: /tmp/230E6503-3092-4DE1-BC76-47C03F92A4D5/20180119-121833.tar.gz
failure: docker ps failed: (Failure "docker ps: timeout after 10.00s")
[OK]     db.git
[OK]     vmnetd
[OK]     dns
[OK]     driver.amd64-linux
[OK]     virtualization VT-X
[OK]     app
[OK]     moby
[OK]     system
[OK]     moby-syslog
[OK]     kubernetes
[OK]     env
[OK]     virtualization kern.hv_support
[OK]     slirp
[OK]     osxfs
[OK]     moby-console
[OK]     logs
[ERROR]  docker-cli
         docker ps failed
[OK]     menubar
[OK]     disk

Diagnostic ID: 230E6503-3092-4DE1-BC76-47C03F92A4D5

bailaohe commented 6 years ago

I had the same issue with docker 18.02-ce. After executing for a interval, some subcommands as 'docker-rmi' hanged forever

brymon68 commented 6 years ago

Having this same issue. Do we need to open another issue?

akimd commented 6 years ago

Hi guys,

No, there's no need for another issue, thanks! We need to understand what is going on here and fix it. Currently we're busy preparing the next releases, we will be back to this issue as soon as possible.

Thanks for your help!

One question though: are you running qcow or raw images? What does

$ ls -l ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.*

give?

ryanfb commented 6 years ago

Thanks for reopening this issue!

For me, that command shows a Docker.qcow2 file. Should I try switching to raw images, and would that happen automatically (if and only if) I upgrade to High Sierra + APFS for my ~/Library volume?

akimd commented 6 years ago

You don't need to update to raw. As a matter of fact I was asking because we found raw disks to be less reliable so far, and I was wondering if it could be related to your problems.

No, we don't migrate from qcow2 to raw, it's only when starting anew that raw might be chosen.

iainbryson commented 6 years ago

+1 ... I'm having the same issue. Docker for Mac is a better experience than virtual box was, what without the intermediate VM and all, but this makes it torture. docker rmi has about a 50% chance of hanging, so when I run out of space it's an endless cycle of rmi, HANG, stop docker, start docker, rmi, rmi, HANG...

Of course, when it's hanging diagnostics doesn't work. But this is what it says when everything's working:

Docker for Mac: version: 17.12.0-ce-mac55 (18467c0ae7afb7a736e304f991ccc1a61d67a4ab)
macOS: version 10.13.3 (build: 17D102)
logs: /tmp/A25F9E34-E0DC-43CA-A70F-CFD8467AF87C/20180228-081116.tar.gz
[OK]     vpnkit
[OK]     vmnetd
[OK]     dns
[OK]     driver.amd64-linux
[OK]     app
[OK]     virtualization VT-X
[OK]     moby
[OK]     system
[OK]     moby-syslog
[OK]     kubernetes
[OK]     env
[OK]     virtualization kern.hv_support
[OK]     moby-console
[OK]     osxfs
[OK]     logs
[OK]     docker-cli
[OK]     disk

Diagnostic ID: A25F9E34-E0DC-43CA-A70F-CFD8467AF87C

(incidentally, rmi and rm do seem to be the main triggers; otherwise it's pretty solid.

The ls command above gives:

iainbryson@Iains-MacBook-Pro (devel) $ ls -l ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.*
-rw-r--r--@ 1 iainbryson  staff  47927853056 Feb 28 08:15 /Users/iainbryson/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2

And this is an APFS volume, if that matters.

ryanfb commented 6 years ago

Still seeing this in 18.03.0-ce-rc1-mac54 (23022).

Usually when this happens, trying to restart or quit-then-start Docker via the Docker for Mac GUI will hang in the "Docker is starting" state, and I have to force quit the com.docker.hyperkit process then open Docker for Mac to get Docker into a usable state again.

Looking around for other solutions, I came across docker/compose#3633, which points (at the end) to moby/moby#35933. This may not be the same issue as what I'm experiencing since people there report that rolling back to 17.09 fixes the issue for them, while I was already experiencing this problem in 17.06. I don't believe that I'm seeing this due to tty, resource, or network issues either.

akimd commented 6 years ago

@ryanfb Thanks for the pointers, these issues are interesting.

I can't reproduce the behaviour. Could you submit a Diagnostic using the latest Edge (mac54 is good)?

ryanfb commented 6 years ago

Diagnostic ID: 230E6503-3092-4DE1-BC76-47C03F92A4D5

I've been trying to dig deeper into this over the last couple days - one thing I've been checking is what's going on inside Docker by using screen ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/tty and watching the kernel log messages when this behavior occurs. The first time I did this I was seeing NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [containerd:906]. Googling took me to #1950 where I tried disabling trim as suggested, but I still kept getting hangs, of the form:

[ 1660.441275] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1660.441744]  2-...: (1 GPs behind) idle=467/140000000000000/0 softirq=35675/35675 fqs=12882
[ 1660.442492]  (detected by 0, t=60142 jiffies, g=24859, c=24858, q=266)
[ 1660.443055] Task dump for CPU 2:
[ 1660.443329] scsi_eh_0       R  running task        0   284      2 0x00000008
[ 1660.443976]  0000000000000000 ffffffff955754cd 0000000000000000 ffff9b719da95000
[ 1660.444684]  ffffaf78c091be00 ffff9b719db68000 ffffaf78c091be78 ffff9b719dda44c0
[ 1660.445518]  0000000000000246 ffffffff9557588c ffff9b7193df1218 ffff9b719d02a658
[ 1660.446342] Call Trace:
[ 1660.446548]  [<ffffffff955754cd>] ? ata_scsi_port_error_handler+0x228/0x544
[ 1660.447211]  [<ffffffff9557588c>] ? ata_scsi_error+0xa3/0xdb
[ 1660.447678]  [<ffffffff9554526d>] ? scsi_error_handler+0xaf/0x472
[ 1660.448198]  [<ffffffff950fe5b3>] ? finish_task_switch+0x115/0x18b
[ 1660.448813]  [<ffffffff957f53e3>] ? __schedule+0x36c/0x465
[ 1660.449383]  [<ffffffff955451be>] ? scsi_eh_get_sense+0xdd/0xdd
[ 1660.449974]  [<ffffffff950f7b56>] ? kthread+0xb4/0xbc
[ 1660.450446]  [<ffffffff950f7aa2>] ? init_completion+0x1d/0x1d
[ 1660.450937]  [<ffffffff957f8261>] ? ret_from_fork+0x41/0x50

Since then I've done a few factory resets - after the first of these, I could no longer disable trim per the instructions in this comment, since ~/Library/Containers/com.docker.docker/Data/database/ no longer existed. When Docker hangs I'm still seeing INFO: rcu_sched detected stalls on CPUs/tasks with a similar backtrace, and/or e.g. NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [containerd:906], as in this screenshot:

igi2em83

febeling commented 6 years ago

On my machine docker --version reliably takes about 30s.

$ time docker --version                                                                                          
Docker version 17.12.0-ce, build c97c6d6
docker --version  0.01s user 0.01s system 0% cpu 30.035 total

eexit commented 6 years ago

Hello,

I'm also having many issues with the latest Docker version. It takes about 2 min to start although only 1 small dnsmasq service is configured to start along with Docker.

When waking up the mac, docker-compose timeouts when stopping or restarting a stack. Only "fix" is to restart Docker which is another 3 min wait...

ryanfb commented 6 years ago

Almost by accident, I think I've discovered a workaround for my particular case. I decided to try running the same docker-compose setup on my (relatively under-resourced) MacBook Pro instead of my iMac, to see what happened with the latest Docker edge (previously I had encountered the same behavior on both machines). Since the internal (SSD) drive on the MBP was running out of space, I decided to try moving the Docker disk image location (the Docker.qcow2 file) to an external USB drive, using the Docker for Mac UI. Miraculously, I was able to do everything I needed to do without Docker crashing or becoming completely unresponsive.

This gave me the idea to try the same thing on my iMac today - and after moving the Docker disk image off the internal 3TB Fusion Drive to an external USB drive, I seem to be able to do everything I need to do without having Docker crash or become completely unresponsive.

All drives (internal and external) are formatted Mac OS Extended Journaled (case-insensitive), and the internal drives report a S.M.A.R.T. status of "Verified" with no other programs appearing to have issues using them.

Perhaps this is consistent with the ATA/SCSI errors causing a CPU/task stall in my logs above, though I'm not sure what the root cause or error is. The thing consistent across both machines is that the problematic drive for Docker is an internal SSD or Fusion Drive.

rn commented 6 years ago

@ryanfb thanks for the logs above. I extracted the logs from the diagnostics and, for now, just add them here for completeness as there actually were some relevant error messages before the hung task messages:

[  446.797918] br-69957a06e2c6: port 4(veth8a39e2d) entered blocking state
[  446.798583] br-69957a06e2c6: port 4(veth8a39e2d) entered forwarding state
[ 1051.734506] ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
[ 1051.735271] ata1.00: cmd 61/00:00:00:04:51/01:00:01:00:00/40 tag 0 ncq dma 131072 out
[ 1051.735271]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.736914] ata1.00: cmd 61/00:08:00:05:51/01:00:01:00:00/40 tag 1 ncq dma 131072 out
[ 1051.736914]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.738570] ata1.00: cmd 61/00:10:00:06:51/01:00:01:00:00/40 tag 2 ncq dma 131072 out
[ 1051.738570]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.740175] ata1.00: cmd 61/00:18:00:07:51/01:00:01:00:00/40 tag 3 ncq dma 131072 out
[ 1051.740175]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.741829] ata1.00: cmd 61/00:20:00:08:51/01:00:01:00:00/40 tag 4 ncq dma 131072 out
[ 1051.741829]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.743430] ata1.00: cmd 61/00:28:00:09:51/01:00:01:00:00/40 tag 5 ncq dma 131072 out
[ 1051.743430]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.745077] ata1.00: cmd 61/00:30:00:0a:51/01:00:01:00:00/40 tag 6 ncq dma 131072 out
[ 1051.745077]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.746778] ata1.00: cmd 61/00:38:00:0b:51/01:00:01:00:00/40 tag 7 ncq dma 131072 out
[ 1051.746778]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.748412] ata1.00: cmd 61/00:40:00:0c:51/01:00:01:00:00/40 tag 8 ncq dma 131072 out
[ 1051.748412]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.750128] ata1.00: cmd 61/00:48:00:0d:51/01:00:01:00:00/40 tag 9 ncq dma 131072 out
[ 1051.750128]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.751764] ata1.00: cmd 61/00:50:00:0e:51/01:00:01:00:00/40 tag 10 ncq dma 131072 out
[ 1051.751764]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.753277] ata1.00: cmd 61/00:58:00:0f:51/01:00:01:00:00/40 tag 11 ncq dma 131072 out
[ 1051.753277]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.755023] ata1.00: cmd 61/00:60:00:10:51/01:00:01:00:00/40 tag 12 ncq dma 131072 out
[ 1051.755023]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.756799] ata1.00: cmd 61/00:68:00:11:51/01:00:01:00:00/40 tag 13 ncq dma 131072 out
[ 1051.756799]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.758374] ata1.00: cmd 61/00:70:00:f3:50/01:00:01:00:00/40 tag 14 ncq dma 131072 out
[ 1051.758374]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.760058] ata1.00: cmd 61/00:78:00:f4:50/01:00:01:00:00/40 tag 15 ncq dma 131072 out
[ 1051.760058]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.761975] ata1.00: cmd 61/00:80:00:f5:50/01:00:01:00:00/40 tag 16 ncq dma 131072 out
[ 1051.761975]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.763722] ata1.00: cmd 61/00:88:00:f6:50/01:00:01:00:00/40 tag 17 ncq dma 131072 out
[ 1051.763722]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.765457] ata1.00: cmd 61/00:90:00:f7:50/01:00:01:00:00/40 tag 18 ncq dma 131072 out
[ 1051.765457]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.767301] ata1.00: cmd 61/00:98:00:f8:50/01:00:01:00:00/40 tag 19 ncq dma 131072 out
[ 1051.767301]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.769089] ata1.00: cmd 61/00:a0:00:f9:50/01:00:01:00:00/40 tag 20 ncq dma 131072 out
[ 1051.769089]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.770706] ata1.00: cmd 61/00:a8:00:fa:50/01:00:01:00:00/40 tag 21 ncq dma 131072 out
[ 1051.770706]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.772230] ata1.00: cmd 61/00:b0:00:fb:50/01:00:01:00:00/40 tag 22 ncq dma 131072 out
[ 1051.772230]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.773659] ata1.00: cmd 61/00:b8:00:fc:50/01:00:01:00:00/40 tag 23 ncq dma 131072 out
[ 1051.773659]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.775117] ata1.00: cmd 61/00:c0:00:fd:50/01:00:01:00:00/40 tag 24 ncq dma 131072 out
[ 1051.775117]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.776791] ata1.00: cmd 61/00:c8:00:fe:50/01:00:01:00:00/40 tag 25 ncq dma 131072 out
[ 1051.776791]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.778550] ata1.00: cmd 61/00:d0:00:ff:50/01:00:01:00:00/40 tag 26 ncq dma 131072 out
[ 1051.778550]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.780251] ata1.00: cmd 61/00:d8:00:00:51/01:00:01:00:00/40 tag 27 ncq dma 131072 out
[ 1051.780251]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.781892] ata1.00: cmd 61/00:e0:00:01:51/01:00:01:00:00/40 tag 28 ncq dma 131072 out
[ 1051.781892]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.783683] ata1.00: cmd 61/00:e8:00:02:51/01:00:01:00:00/40 tag 29 ncq dma 131072 out
[ 1051.783683]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.785207] ata1.00: cmd 61/00:f0:00:03:51/01:00:01:00:00/40 tag 30 ncq dma 131072 out
[ 1051.785207]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1051.786715] ata1: hard resetting link
[ 1119.329182] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1119.329707]  2-...: (1 GPs behind) idle=467/140000000000000/0 softirq=35675/35675 fqs=1572
[ 1119.330499]  (detected by 1, t=6003 jiffies, g=24859, c=24858, q=45)
[ 1119.331166] Task dump for CPU 2:
[ 1119.331517] scsi_eh_0       R  running task        0   284      2 0x00000008
[ 1119.332222]  0000000000000000 ffffffff955754cd 0000000000000000 ffff9b719da95000
[ 1119.333034]  ffffaf78c091be00 ffff9b719db68000 ffffaf78c091be78 ffff9b719dda44c0
[ 1119.333808]  0000000000000246 ffffffff9557588c ffff9b7193df1218 ffff9b719d02a658
[ 1119.334685] Call Trace:
[ 1119.334915]  [<ffffffff955754cd>] ? ata_scsi_port_error_handler+0x228/0x544
[ 1119.335638]  [<ffffffff9557588c>] ? ata_scsi_error+0xa3/0xdb
[ 1119.336229]  [<ffffffff9554526d>] ? scsi_error_handler+0xaf/0x472
[ 1119.336844]  [<ffffffff950fe5b3>] ? finish_task_switch+0x115/0x18b
[ 1119.337453]  [<ffffffff957f53e3>] ? __schedule+0x36c/0x465
[ 1119.338027]  [<ffffffff955451be>] ? scsi_eh_get_sense+0xdd/0xdd
[ 1119.338591]  [<ffffffff950f7b56>] ? kthread+0xb4/0xbc
[ 1119.339057]  [<ffffffff950f7aa2>] ? init_completion+0x1d/0x1d
[ 1119.339734]  [<ffffffff957f8261>] ? ret_from_fork+0x41/0x50
[ 1300.536129] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1300.536665]  2-...: (1 GPs behind) idle=467/140000000000000/0 softirq=35675/35675 fqs=8009
[ 1300.537446]  (detected by 1, t=24133 jiffies, g=24859, c=24858, q=122)
[ 1300.538090] Task dump for CPU 2:
[ 1300.538404] scsi_eh_0       R  running task        0   284      2 0x00000008
[ 1300.539118]  0000000000000000 ffffffff955754cd 0000000000000000 ffff9b719da95000
[ 1300.540048]  ffffaf78c091be00 ffff9b719db68000 ffffaf78c091be78 ffff9b719dda44c0
[ 1300.540880]  0000000000000246 ffffffff9557588c ffff9b7193df1218 ffff9b719d02a658
[ 1300.541617] Call Trace:
[ 1300.541863]  [<ffffffff955754cd>] ? ata_scsi_port_error_handler+0x228/0x544
[ 1300.542524]  [<ffffffff9557588c>] ? ata_scsi_error+0xa3/0xdb
[ 1300.543058]  [<ffffffff9554526d>] ? scsi_error_handler+0xaf/0x472
[ 1300.543643]  [<ffffffff950fe5b3>] ? finish_task_switch+0x115/0x18b
[ 1300.544317]  [<ffffffff957f53e3>] ? __schedule+0x36c/0x465
[ 1300.544813]  [<ffffffff955451be>] ? scsi_eh_get_sense+0xdd/0xdd
[ 1300.545376]  [<ffffffff950f7b56>] ? kthread+0xb4/0xbc
[ 1300.545888]  [<ffffffff950f7aa2>] ? init_completion+0x1d/0x1d
[ 1300.546431]  [<ffffffff957f8261>] ? ret_from_fork+0x41/0x50

So this indicates some hang/timeout in the hyperkit blockdevice layer (or further down in the qcow2 code). We had related issues in the past (see https://github.com/moby/hyperkit/issues/94).

@ryanfb you mentioned that your MBP is relatively under-resourced and may have been close to running out of diskspace but also said your iMac has 3TB drive and you had the same issue there. Is the drive in the iMac also close to full?

I've not been able to repro this locally with the two cases you mentioned above

ryanfb commented 6 years ago

The 3TB drive in the iMac has/had about 400GB+ free when I was encountering this problem. In the Docker for Mac UI the disk image was sized to ~200GB with ~20GB used on disk. The MBP is under-resourced in terms of memory/CPU - only 8GB (vs. 32GB in the iMac) and a slower/older CPU.

raliste commented 6 years ago

+1 Couldn't even get a diagnostic.

akimd commented 6 years ago

Can someone reproduce it with 18.03 (stable or edge) and post diagnostics please?

rclarkburns commented 6 years ago

@akimd Diagnostic ID: 3AE6953B-DEE7-4441-A4C9-11E944ECB248

Docker for Mac: version: 18.03.0-ce-mac59 (dd2831d4b7421cf559a0881cc7a5fdebeb8c2b98)
macOS: version 10.12.6 (build: 16G1212)
logs: /tmp/3AE6953B-DEE7-4441-A4C9-11E944ECB248/20180328-122732.tar.gz
failure: docker ps failed: (Failure "docker ps: timeout after 10.00s")
[OK]     vpnkit
[OK]     vmnetd
[OK]     dns
[OK]     driver.amd64-linux
[OK]     virtualization VT-X
[OK]     app
[OK]     moby
[OK]     system
[OK]     moby-syslog
[OK]     kubernetes
[OK]     files
[OK]     env
[OK]     virtualization kern.hv_support
[OK]     osxfs
[OK]     moby-console
[OK]     logs
[ERROR]  docker-cli
         docker ps failed
[OK]     disk

febeling commented 6 years ago

FWIW, here's another one:

Docker for Mac: version: 18.03.0-ce-mac59 (dd2831d4b7421cf559a0881cc7a5fdebeb8c2b98)
macOS: version 10.13.3 (build: 17D102)
logs: /tmp/CF6CE828-01B4-4174-ADBE-A88C9730F835/20180328-193859.tar.gz
[OK]     vpnkit
[OK]     vmnetd
[OK]     dns
[OK]     driver.amd64-linux
[OK]     virtualization VT-X
[OK]     app
[OK]     moby
[OK]     system
[OK]     moby-syslog
[OK]     kubernetes
[OK]     files
[OK]     env
[OK]     virtualization kern.hv_support
[OK]     osxfs
[OK]     moby-console
[OK]     logs
[OK]     docker-cli
[OK]     disk

Diagnostic ID: CF6CE828-01B4-4174-ADBE-A88C9730F835

One thing I noticed is that the delay is always just a few milliseconds over 30s. My hunch is there's some timeout of 30s which blocks execution, before the actual command is performed.

rclarkburns commented 6 years ago

@ryanfb Curious to know if your disk is encrypted.

I also tried switching to an external (Mac OS Extended (Journaled)) volume on a USB drive and although it is much slower I'm no longer having the same issue as before. The main volume on my Macbook Pro is Mac OS Extended (Journaled, Encrypted). Perhaps using an encrypted volume has something to do with this?

eexit commented 6 years ago

My disk is encrypted as well... Putting the storage onto another drive is definitely not an option for me or my company because we'd lose the interest of having our HDs encrypted 😞

rclarkburns commented 6 years ago

I'm going to create another partition (without encryption) on my internal SSD and see if that works as well.

rclarkburns commented 6 years ago

@eexit Do you really need encryption? I'm sure your company will be just fine. 😆 If it is related to encryption hopefully identifying this as a common factor will expedite a resolution.

eexit commented 6 years ago

@rclarkburns Unfortunately, I don't have to choose... it's a company-wide policy =/

rclarkburns commented 6 years ago

Sorry @eexit if my comment came across as serious. It was just a silly attempt at a joke. Encryption is important!

ryanfb commented 6 years ago

@rclarkburns Great question! It turns out, in both cases the external drive wasn't encrypted, but the internal drive was. For some reason I had originally thought the external drive on the iMac was encrypted, so I hadn't thought about that as a potential common factor.

rclarkburns commented 6 years ago

I have since setup another partition on my internal drive that is not encrypted but experienced the same issue. At this point I'm not convinced it's related to the encryption. I guess to fully rule it out it would be interesting to test an external drive that is encrypted. For now I'm using docker-machine with Virtualbox to move forward on the project I'm currently working on.

ryanfb commented 6 years ago

I'm going to try testing with an encrypted external drive on one or both machines - I'll update later with the results.

ryanfb commented 6 years ago

Update: I'm still able to run my problematic docker-compose from scratch (i.e. all volumes/containers removed before trying) on both machines after moving the Docker.qcow2 file to an encrypted external drive.

sergiusnick commented 6 years ago

Docker for Mac: version: 18.03.0-ce-mac59 (dd2831d4b7421cf559a0881cc7a5fdebeb8c2b98) macOS: version 10.12.6 (build: 16G1212) logs: /tmp/5204E20A-6678-43C5-871E-F4D5CA77CF8C/20180402-151111.tar.gz failure: docker ps failed: (Failure "docker ps: timeout after 10.00s") [OK] vpnkit [OK] vmnetd [OK] dns [OK] driver.amd64-linux [OK] virtualization VT-X [OK] app [OK] moby [OK] system [OK] moby-syslog [OK] kubernetes [OK] files [OK] env [OK] virtualization kern.hv_support [OK] osxfs [OK] moby-console [OK] logs [ERROR] docker-cli docker ps failed [OK] disk

One more DIAGNOSTIC ID: 5204E20A-6678-43C5-871E-F4D5CA77CF8C

rclarkburns commented 6 years ago

Thanks for the update @ryanfb. Sounds like encryption is not the issue but using external volumes consistently resolves the issue regardless if encrypted or not. I should also note that when I tested with a separate internal partition that wasn't encrypted, I used several different formats (Mac OS Extended (Journaled), Mac OS Extended (Case-sensitive, Journaled), and MS-DOS (FAT)) but it didn't resolve the issue.

stringfellow commented 6 years ago

This seems to happen to me after deleting old images, everything just stops working. Even after a reboot, Docker for Mac fails to start up and then on quitting, the com.docker.hyperkit process hangs around consuming CPU. [Edit: after a few attempts at rebooting and stopping all docker processes in Activity Monitor, I managed to replace the app with the latest from the docker shop]

scottfc commented 6 years ago

Just experienced what @stringfellow mentions after docker system prune (Total reclaimed space: 22.64GB, unsure of the amount of images...) with Experimental features on (not sure if it matters). Docker seems to be rebuilding something, or checking something? It eventually gets past whatever it is doing and starts normally.

ndevenish commented 6 years ago

I'm also having hanging issues since updating to 18.03.0-ce (from 17.something). Mainly seems to happen at the end of longer builds, though sometimes just when coming back to use docker after a while. Quitting/restarting docker seems to be the only solution, though it happens quickly again. I've run a system prune which cleared up 30gb or so, but this keeps reoccurring.

I don't have an encrypted drive. MacBookPro14,2 (i7 Macbook Pro 2017, 13 inch).

Diagnostics aren't very useful; "docker ps failed" just tells me what I already know:

Docker for Mac: version: 18.03.0-ce-mac60 (dd2831d4b7421cf559a0881cc7a5fdebeb8c2b98)
macOS: version 10.13.1 (build: 17B1003)
logs: /tmp/8A3BB2B3-3CCA-403F-86B5-EEB38CCD73DE/20180414-145206.tar.gz
failure: docker ps failed: (Failure "docker ps: timeout after 10.00s")
[OK]     vpnkit
[OK]     vmnetd
[OK]     dns
[OK]     driver.amd64-linux
[OK]     virtualization VT-X
[OK]     app
[OK]     moby
[OK]     system
[OK]     moby-syslog
[OK]     kubernetes
[OK]     files
[OK]     env
[OK]     virtualization kern.hv_support
[OK]     osxfs
[OK]     moby-console
[OK]     logs
[ERROR]  docker-cli
         docker ps failed
[OK]     disk

rn commented 6 years ago

@ryanfb and possibly also @ndevenish i see disk related kernel crashes/timeouts in your diagnostics, if you are on macOS 10.13.4 could you try switching to raw disks? You can do that by editing:

~/Library/Group\ Containers/group.com.docker/settings.json

If you change the settings for diskPath from Docker.qcow2 to Docker.raw and then restart you should be using a different disk backend. The Docker.raw disk must be on an APFS backed disk as it uses sparse files and you must be on macOS 10.13.4 as 10.13.3 seems to have a APFS bug with sparse files resulting in occasional disk corruption.

Note, if you change the disks you will "loose" all you containers in you current docker instance so you'll have to download them again.

ndevenish commented 6 years ago

@rn I've actually just noticed that I'm behind OS version, on 10.13.1. I'll update and see if I can reproduce, then see if stepping to raw helps. Presumably I also need to backup my volumes for this...

rn commented 6 years ago

I don't think the OS version matters for the qcow2 backend. The reason to test the raw backend on APFS is that, at least fro @ryanfb kernel logs the kernel reports some ATA timeouts. These could happen when we trim (ie trying to reclaim space) and the QCOW2 code for that is obviously radically different to the raw backend on APFS with sparse files.

So trying a different backend may help us to isolate if the issue is on HyperKit, the VM or the QCOW2 storage backend.

zan-xhipe commented 6 years ago

I'm also experiencing this. My docker version is 18.03.0-ce Mac OS version 10.13.4. I have tried both qcow2 and raw with the same result. I have also tried resetting to factory defaults. Here is the Diagnostic ID while using raw backend 61FF7D88-DFA5-46D4-9F1A-200951534E77 Everything was working fine until today.

morkov commented 6 years ago

Same issue last 3 days. Version 18.03.1-ce-mac64 (24245), Mac OS 10.12.6.

rn commented 6 years ago

@ryanfb since you saw the sata issues I pasted above, could you try a new build of hyperkit from https://565-55985023-gh.circle-artifacts.com/0/hyperkit. you need to make this executable and copy to /Applications/Docker.app/Contents/Resources/bin/com.docker.hyperkit (I would save the original first, though) and then restart.

This version of HyperKit has a patch tot eh AHCI controller from upstream bhyve backported, so may be related.

The encrypted drive thing may also be a clue as it might cause things to take a longer than the ATA driver in Linux is willing to tolerate (not a ATA expert...)

ryanfb commented 6 years ago

@rn Thanks, I will give the new hyperkit build a try and report back. Trying with the raw backend isn't as easy for me as I haven't upgraded any of my machines to High Sierra yet and have some qualms about migrating to APFS.

rn commented 6 years ago

@ryanfb thanks, and understood re upgrade. We certainly had some filesystem corruption issues with sparse files on APFS prior to 10.13.4. Apple seem to have fixed all the known issues with that though (of course without any mention in the changelog).

In any case, qcow2 would also be interesting, though it has more moving parts, hence harder to diagnose.

sparecycles commented 6 years ago

So, not sure if my problem is anyone else's problem, but the docker daemon appears to hang if there are recursive (?) folder symlinks in the build context. (In my case, from bootstrapped lerna packages).

Making sure to exclude those paths in my .dockerignore file helped.

Docker version 18.03.1-ce, build 9ee9f40 / OSX 10.12.6

docker / for-mac