docker / for-mac

Bug reports for Docker Desktop for Mac
https://www.docker.com/products/docker#/mac
2.44k stars 117 forks source link

Docker's large default NFS rsize/wsize hanging container when accessing >~200kb files in Ventura #6544

Open karlshea opened 2 years ago

karlshea commented 2 years ago

Expected behavior

Everything works normally.

Actual behavior

Container will hang. If caught within a second or two, ^C can quit the container if it's running in the foreground (docker compose up). Otherwise Docker itself can hang to the point where the process needs to be killed. nfsd send error 40 also appears in the MacOS Console.

Information

Output of /Applications/Docker.app/Contents/MacOS/com.docker.diagnose check

Starting diagnostics

[PASS] DD0027: is there available disk space on the host?
[PASS] DD0028: is there available VM disk space?
[PASS] DD0018: does the host support virtualization?
[PASS] DD0001: is the application running?
[PASS] DD0017: can a VM be started?
[PASS] DD0016: is the LinuxKit VM running?
[PASS] DD0011: are the LinuxKit services running?
[PASS] DD0004: is the Docker engine running?
[PASS] DD0015: are the binary symlinks installed?
[PASS] DD0031: does the Docker API work?
[PASS] DD0013: is the $PATH ok?
[PASS] DD0003: is the Docker CLI working?
[PASS] DD0014: are the backend processes running?
[PASS] DD0007: is the backend responding?
[PASS] DD0008: is the native API responding?
[PASS] DD0009: is the vpnkit API responding?
[PASS] DD0010: is the Docker API proxy responding?
[PASS] DD0012: is the VM networking working?
[SKIP] DD0030: is the image access management authorized?
[PASS] DD0019: is the com.docker.vmnetd process responding?
[PASS] DD0033: does the host have Internet access?
[PASS] DD0018: does the host support virtualization?
[PASS] DD0001: is the application running?
[PASS] DD0017: can a VM be started?
[PASS] DD0016: is the LinuxKit VM running?
[PASS] DD0011: are the LinuxKit services running?
[PASS] DD0004: is the Docker engine running?
[PASS] DD0015: are the binary symlinks installed?
[PASS] DD0031: does the Docker API work?
[PASS] DD0032: do Docker networks overlap with host IPs?
segment 2022/10/28 11:24:45 ERROR: sending request - Post "https://api.segment.io/v1/batch": dial tcp: lookup api.segment.io: no such host
segment 2022/10/28 11:24:45 ERROR: 1 messages dropped because they failed to be sent and the client was closed
No fatal errors detected.

(Pi-hole is blocking api.segment.io)

Steps to reproduce the behavior

  1. Clone https://github.com/karlshea/docker-nfs
  2. Fix your path in docker-compose.yaml
  3. Follow the tests in that repo's README.md
  4. Another users reports a hang when iterating over a directory with a large number of files, or using cat/md5sum: https://github.com/drud/ddev/issues/4122#issuecomment-1294862469_

Plain NFS access from another Mac seems to work normally.

Related issue: drud/ddev#4122

tsrivishnu commented 1 year ago

Facing the same issue with a Ruby on Rails project. Moving away from NFS mount removes the problem but we would need NFS for the project to work efficiently.

noud-github commented 1 year ago

FYI: adding ,wsize=32768,rsize=3276 to our docker NFS mount options seems to fix this issue.


nfsmount_xdebug:
    driver: local
    driver_opts:
      type: nfs
      o: addr=host.docker.internal,rw,nolock,hard,nointr,nfsvers=3,wsize=32768,rsize=3276
      device: ":${PWD}/xdebug"

edit:
unexpected but the protocol maximum also seem to work wsize=65536,rsize=65536
ryanchapman commented 1 year ago

Following @noud-github's suggestion, adding wsize and rsize options fixes the issue for me as well.

Although I went with ,wsize=32768,rsize=32768 (versus 3276 for rsize).

karlshea commented 1 year ago

I think wsize/rsize might just be masking the problem. Trying an md5sum on a 100MB zip file through Docker still hangs, while it succeeds on a normal NFS mount on another Mac.

noud-github commented 1 year ago

@karlshea using wsize=65536,rsize=65536 i can do a md5sum 100MB.zip on that NFS share in docker, did you remove the "NFS Volume" seen with docker volume list after editing the options? if you don't (or forgot like I did the first time) the new setting are not applied.

karlshea commented 1 year ago

@noud-github You're right, I didn't! wsize=32768,rsize=32768 does indeed fix it for me. It looks like 32768 is the default for recent distros, so I'm curious what the Docker driver is using instead.

I still believe this is covering up a deeper issue (why are smaller values breaking?), but at least it's fixing the immediate problem.

noud-github commented 1 year ago

@karlshea I think you are right in assuming this is just covering up a deeper issue. the fact that this is not a issue on two system pointed me in the direction to try this solution in the first place, Because when you connect to another system, you use two "real" networks stacks, inc buffers sizes etc. but if you run this on you local system you use "local interface" and this is not the first time i have had "unexpected" behavior when only using "local interface", So my guess would be that ventura has some "bug" or feature in the Local interface stack, that is triggered when not setting the wsize and rsize in NFS

karlshea commented 1 year ago

macOS defaults

macOS defaults (from man mount_nfs) are 8192 for UDP mounts and 32768 for TCP mounts.

Additional notes from wsize param: "Note that both the rsize and wsize options should only be used as a last ditch effort at improving performance when mounting servers that do not support TCP mounts."

nfsstat -m for a Mac-to-Mac NFS mount using all default options (mount -t nfs server-mac:/server-path directory):

General mount flags: 0x4000018 nodev,nosuid,multilabel NFS parameters: vers=3,tcp,port=2049,nomntudp,hard,nointr,noresvport,negnamecache,callumnt,locks,quota,rsize=32768,wsize=32768,readahead=16,dsize=32768,rdirplus,nodumbtimer,timeo=10,maxgroups=16,acregmin=5,acregmax=60,acdirmin=5,acdirmax=60,nomutejukebox,nonfc,sec=sys

Docker defaults

I tried to find defaults by mounting with no options other than addr:

volumes:
  nfsmount-repo:
    driver: local
    driver_opts:
      type: nfs
      o: "addr=host.docker.internal"
      device: ":/Users/karl/Sites/nfs-test"

Then got into the Docker VM using justincormack/nsenter1 and ran mount:

/Users/karl/Sites/nfs-test on /var/lib/docker/volumes/nfs-test_nfsmount-repo/_data type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.65.2,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.65.2

Which is pretty strange, since our fixes actually seem to be setting the values lower. man mount_nfs says

The default read and write sizes are 8K when using UDP, and 32K when using TCP. Values over 16K are only supported for TCP, where 2M is the maximum.

Any value over 32K is unlikely to get you more performance, unless you have a very fast network.

If the network interface cannot handle larger packet sizes or a long train of back to back packets, you may see low performance figures or even temporary hangups during NFS activity.

This seems to possibly point to the root cause.

noud-github commented 1 year ago

looks like docker is trying to use a pretty high default package size, that in combination with:

If the network interface cannot handle larger packet sizes or a long train of back to back packets, you may see low performance figures or even temporary hangups during NFS activity.

that makes a solid case for using a lower package size, as we found ventura's "local interface" doing just that. makes you wonder what they chanced there ;-)

karlshea commented 1 year ago

Those sizes are supposed to be powers of 2 according to the man page. I tested up to 262144 (md5sum hangs), the biggest that worked was 131072.

rfay commented 1 year ago

Oh I forgot - I had tested this problem with Colima in https://github.com/drud/ddev/issues/4122#issuecomment-1272648461 - So this is not strictly a Docker Desktop issue I don't think.

noud-github commented 1 year ago

So this is not strictly a Docker Desktop issue I don't think.

i am using ranger desktop so I can concur on that.

shinde-rahul commented 1 year ago

The wsize=8192,rsize=8192 fix my issue too.

Carpenter0100 commented 1 year ago

@noud-github

Fuck this! I spent the whole day trying to find a solution to this problem. Thank you very much.

Could you explain how you came up with the solution and how you investigated the problem? Thank you.

karlshea commented 1 year ago

Could you explain how you came up with the solution and how you investigated the problem? Thank you.

All of the investigation is in this issue and drud/ddev#4122. I believe all of us looking into it thought we were raising the Docker defaults to fix the problem, but it turns out we were lowering them.

notinaboat commented 1 year ago

FWIW: I'm seeing nfsd send error 40 and md5sum bigfile hanging with Ventura NFS server and Raspberry Pi clients over WiFi (no Docker involved). Was reliable before Ventura. So this looks like a macOS problem.

wsize=65536,rsize=65536 works for me.

Krilo89 commented 1 year ago

Exactly what @Carpenter0100 says ;). How did you come up with those params @noud-github ?

freef4ll commented 1 year ago

The recently released Ventura 13.1 now has issues with wsize=32768,rsize=32768 and fail to use them:

nfssvc_addsock: socket buffer setting error(s) 22

Error code 22 is EINVAL.

Bumping the the NFS socket buffer to 65536 solves the issue for me.

herveguetin commented 1 year ago

Same issue with a Magento 2 project using huge amount of Composer dependencies. Using wsize=65536,rsize=65536 fixed the issue. Once YAML file is updated, do not forget to:

  1. docker compose down
  2. docker volume rm [NFSMOUNT_VOLUME_NAME]
  3. docker-compose up -d
noud-github commented 1 year ago

@Krilo89

actually, it was down to experience in DevOps for more than 2 decades , cannot find the git issue, but there was a vertura/docker/nfs issue where someone mentioned that running NFS on one laptop and docker on the other did not have the issue. That reminded me of a 2 decade old issue on windows where the mtu size was not respected/applied by the local interface (lo) breaking iSCSI on a local system so I actually only googled on NFS and packet size to find the solution

[edit]the working cross mac came from this tread: https://github.com/drud/ddev/issues/4122#issuecomment-1294862469

noud-github commented 1 year ago

@Carpenter0100 see above

docker-robott commented 1 year ago

There hasn't been any activity on this issue for a long time. If the problem is still relevant, mark the issue as fresh with a /remove-lifecycle stale comment. If not, this issue will be closed in 30 days.

Prevent issues from auto-closing with a /lifecycle frozen comment.

/lifecycle stale

karlshea commented 1 year ago

/remove-lifecycle stale /lifecycle frozen

There are workarounds, but with out of the box defaults it's broken. If anyone from the Docker org bothered replying to shed any light on this situation maybe it could move more towards "fixed".

colradec commented 1 year ago

Has anybody encountered a drastic decrease in performance (both r/w) of nfs mounts in 13.4? Installing our CI takes over ~40 minutes instead of ~5 (both M1 and M1 Pro)

I was debugging if the docker version caused this, however with another Mac with 13.3.1 no performance issues were noticeable.

To all of you, do NOT update to 13.4! The VirtIO FS does not have any performance issues but this problem #6820 seems to be present.