containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.79k stars 2.42k forks source link

Running docker compose leads to docker socket unvailable and WSL stuck #20379

Open jeffmaury opened 1 year ago

jeffmaury commented 1 year ago

Issue Description

On my Windows laptop, I ran docker compose up (forget to set DOCKER_BUILDKIT=0) on https://github.com/docker/awesome-compose/tree/master/nginx-golang-postgres

After a while, compose seems to be stuck on : => => sending tarball

The only container running (docker.io/moby/buildkit:buildx-stable-1) has the following log:


time="2023-10-17T07:36:30Z" level=info msg="auto snapshotter: using overlayfs"
time="2023-10-17T07:36:30Z" level=warning msg="using host network as the default"
time="2023-10-17T07:36:30Z" level=info msg="found worker \"tmz81n0vbj5n2unpohhw1dyhk\", labels=map[org.mobyproject.buildkit.worker.executor:oci org.mobyproject.buildkit.worker.hostname:e64f09d4e167 org.mobyproject.buildkit.worker.network:host org.mobyproject.buildkit.worker.oci.process-mode:sandbox org.mobyproject.buildkit.worker.selinux.enabled:false org.mobyproject.buildkit.worker.snapshotter:overlayfs], platforms=[linux/amd64 linux/amd64/v2 linux/amd64/v3 linux/386]"
time="2023-10-17T07:36:30Z" level=warning msg="skipping containerd worker, as \"/run/containerd/containerd.sock\" does not exist"
time="2023-10-17T07:36:30Z" level=info msg="found 1 workers, default=\"tmz81n0vbj5n2unpohhw1dyhk\""
time="2023-10-17T07:36:30Z" level=warning msg="currently, only the default worker can be used."
time="2023-10-17T07:36:30Z" level=info msg="running server on /run/buildkit/buildkitd.sock"
time="2023-10-17T07:37:36Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = failed to copy to tar: rpc error: code = Canceled desc = grpc: the client connection is closing"

Once this state is reached, the Docker socket for the Podman machine is not available anymore and it seems WSL is quite stuck: after I Ctrl+C the compose process and stop and rmed the container, restart of the podman machine gives:


Starting machine "podman-machine-default"
API forwarding for Docker API clients is not available due to the following startup failures.
        CreateFile \\.\pipe\docker_engine: Toutes les instances des canaux de communication sont occupées.

Podman clients are still able to connect.
Machine "podman-machine-default" started successfully

and if I try wsl --shutdown after the machine has been stopped, it seems to be stuck and never returns

Steps to reproduce the issue

Steps to reproduce the issue

  1. podman init --rootful --now
  2. docker compose up

Describe the results you received

WSL seems to be stuck and Docker socket is lost

Describe the results you expected

Should not get WSL to be stuck

podman info output

host:
  arch: amd64
  buildahVersion: 1.32.0
  cgroupControllers:
  - cpuset
  - cpu
  - cpuacct
  - blkio
  - memory
  - devices
  - freezer
  - net_cls
  - perf_event
  - net_prio
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: conmon-2.1.7-2.fc38.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 98.93
    systemPercent: 0.36
    userPercent: 0.71
  cpus: 12
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    variant: container
    version: "38"
  eventLogger: journald
  freeLocks: 2046
  hostname: DESKTOP-JEFF
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.15.90.1-microsoft-standard-WSL2
  linkmode: dynamic
  logDriver: journald
  memFree: 15473422336
  memTotal: 16646250496
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.8.0-1.fc38.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.8.0
    package: netavark-1.8.0-2.fc38.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.8.0
  ociRuntime:
    name: crun
    package: crun-1.9.2-1.fc38.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.9.2
      commit: 35274d346d2e9ffeacb22cc11590b0266a23d634
      rundir: /run/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20231004.gf851084-1.fc38.x86_64
    version: |
      pasta 0^20231004.gf851084-1.fc38.x86_64
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.1-1.fc38.x86_64
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 4294967296
  swapTotal: 4294967296
  uptime: 0h 4m 40.00s
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /usr/share/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 1081101176832
  graphRootUsed: 2376384512
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 14
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.7.0
  Built: 1695839078
  BuiltTime: Wed Sep 27 20:24:38 2023
  GitCommit: ""
  GoVersion: go1.20.8
  Os: linux
  OsArch: linux/amd64
  Version: 4.7.0

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

Yes

Additional environment details

Win11Pro

wsl -v

Version WSL : 1.2.5.0
Version du noyau : 5.15.90.1
Version WSLg : 1.0.51
Version MSRDC : 1.2.3770
Version direct3D : 1.608.2-61064218
Version de DXCore : 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Version de Windows : 10.0.22621.2428

Additional information

Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting

baude commented 1 year ago

@n1hility any ideas on what might be going on?

n1hility commented 1 year ago

Looking into this one.

/assign

n1hility commented 1 year ago

I found some time to dig into this further, and the underlying issue is a problem with WSL networking[1]. I tracked down the root implementation cause and have opened an issue with analysis and suggestions on how to fix. I will continue to discuss with them. We could bypass their relay and use a direct connection to the host, but it would only address container control communication, port publishing from the container (e.g. run -p XXXX) would still be susceptible. First I want to see if they plan to take this up soon. It's severe enough they might prioritize it, and waiting for the fix would be the best solution. If not I will post a patch for the partial workaround.

[1] https://github.com/microsoft/WSL/issues/10688

github-actions[bot] commented 11 months ago

A friendly reminder that this issue had no activity for 30 days.

blubberdiblub commented 9 months ago

FWIW, if this is caused by the WSL networking stuff, this is likely also responsible for the infinite hangs I see on some podman pull for some images (apparently bigger ones with a higher number of layers, which would support the hypothesis; e.g. try the python:* images).

Luap99 commented 7 months ago

Can we close this one given the bug seems to be in WSL and not podman?

Fydon commented 7 months ago

Keeping it open would enable people to more easily find it rather than opening a new issue. I'm barely able to use podman on Windows at the moment with this problem, so maybe there will be more like me.

Luap99 commented 7 months ago

Keeping it open would enable people to more easily find it rather than opening a new issue. I'm barely able to use podman on Windows at the moment with this problem, so maybe there will be more like me.

Yes agreed given it seems to affect more people

Fydon commented 6 months ago

Given that using mirrorred networking mode is a suggested workaround for the problem we are experiencing, does that impact multi-architecture builds?

My builder on Windows 11 using x64 is now only showing amd64 when I think it used to show arm64 as an option (or at least it did with Docker Desktop). Build with --platform linux/arm64 are still running, but docker run with --platform linux/arm64 errors with exec container process `/bin/bash`: Exec format error.

> docker builder ls
NAME/NODE        DRIVER/ENDPOINT                      STATUS     BUILDKIT   PLATFORMS
mybuilder*       docker-container
 \_ mybuilder0    \_ npipe:////./pipe/docker_engine   running    v0.13.1    linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386
default          docker-container
 \_ default       \_ default                          inactive
desktop-linux                                         error
Varriount commented 6 months ago

@Fydon I actually tried enabling mirrored networking as a workaround, and found that it caused Podman/Docker Compose to stop working. Were you able to get it to work?

Fydon commented 6 months ago

I'm able to load images, which I wasn't able to with NAT networking. However I'm not able to run the ARM64 images I'm trying to test, as stated above, even if I switch back to NAT networking.

POnakS commented 5 months ago

Btw this happens only when containers are built as part of docker-compose, otherwise it works.

POnakS commented 5 months ago

Workaround with mirrored mode does not work for me due ot this: https://github.com/containers/podman/issues/22975

Fydon commented 4 months ago

I see with Podman version 5.1.2 that no longer leads to docker socket becoming unvailable and WSL being stuck when exporting images and instead fails after a short period, which is better:

 => ERROR exporting to docker image format                                                                   23.6s 
 => => exporting layers                                                                                      23.5s 
 => => exporting manifest sha256:9be2485ea3d100a0f45db0aa85e8cc1f106123261fc8d4d8ecca3c7a22602ff9            0.0s 
 => => exporting config sha256:ce465544cdca9147515410aaf735a1067246a4d0f98aaa0534be55997e0f6c11              0.0s 
 => => sending tarball                                                                                       0.0s 
------
 > exporting to docker image format:
------
ERROR: failed to solve: failed to copy to tar: rpc error: code = Unknown desc = io: read/write on closed pipe

View build details: docker-desktop://dashboard/build/foo/foo0/gx2ckyaj3uvfq5uu8kp8xobyk

It will also immediately fail if the same command is run again. WSL continues to function and other podman/docker commands can will work as long as they don't result in exporting images.