containers / netavark

Container network stack
Apache License 2.0
515 stars 83 forks source link

DHCP proxy - no available capacity / crash when using external DHCP service #1024

Open jjzazuet opened 2 months ago

jjzazuet commented 2 months ago

Hi, I have a few standalone containers running under podman, using a macvlan network to make them available to an internal LAN network. I'm observing the following:

1: Every time the external router assigns a DHCP host configuration to a container, netavark logs this message:

dhcp-proxy: [ERROR netavark::commands::dhcp_proxy] no available capacity

2: Every week or so I will get a hard crash on netavark with the following logs, and the container will no longer be reachable at the static IP lease addresses:

Jul 11 21:29:36 build-00 user.notice dhcp-proxy: thread '<unnamed>' panicked at library/std/src/sys/pal/unix/stack_overflow.rs:158:13:
Jul 11 21:29:36 build-00 user.notice dhcp-proxy: failed to set up alternative stack guard page: Out of memory (os error 12)
Jul 11 21:29:36 build-00 user.notice dhcp-proxy: note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Jul 11 21:29:36 build-00 user.notice dhcp-proxy: thread '<unnamed>' panicked at library/std/src/sys/pal/unix/stack_overflow.rs:154:13:
Jul 11 21:29:36 build-00 user.notice dhcp-proxy: failed to allocate an alternative stack: Out of memory (os error 12)
Jul 11 21:29:36 build-00 daemon.err /etc/init.d/netavark-dhcp-proxy[38874]: start-stop-daemon: failed to start `/usr/libexec/podman/netavark'
Jul 11 21:29:36 build-00 user.notice dhcp-proxy:  * start-stop-daemon: failed to start `/usr/libexec/podman/netavark'

I just tried building and upgrading to the latest upstream version of netavark, but I'm seeing the same log messages. I'll wait to see if I get a crash with this version.

# /opt/netavark version
{
  "version": "1.12.0-dev",
  "commit": "e182147b6aea964f572a4ca981bc000698d59539",
  "build_time": "2024-07-12T03:39:23.741229481+00:00",
  "target": "x86_64-alpine-linux-musl",
  "default_fw_driver": "iptables"
}
# podman network inspect podman30
[
     {
          "name": "podman30",
          "id": "71d03f55fc0de8a074d5e5de88759269e32da004c568f99bb94a420f2e7f31a2",
          "driver": "macvlan",
          "network_interface": "br30",
          "created": "2024-06-23T04:40:07.724065801Z",
          "ipv6_enabled": false,
          "internal": false,
          "dns_enabled": false,
          "ipam_options": {
               "driver": "dhcp"
          },
          "containers": {
               "2852850d2afb74e1e27aa46c2ab25a887f3abb9989d41c8e9699b4fea60d3f51": {
                    "name": "excalidraw",
                    "interfaces": {
                         "eth0": {
                              "subnets": [
                                   {
                                        "ipnet": "172.16.30.115/24",
                                        "gateway": "172.16.30.1"
                                   }
                              ],
                              "mac_address": "ee:b7:eb:0b:b1:2b"
                         }
                    }
               }
          }
     }
]
/home/gopher # podman info
host:
  arch: amd64
  buildahVersion: 1.35.4
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  cgroupManager: cgroupfs
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.12-r0
    path: /usr/bin/conmon
    version: 'conmon version 2.1.12, commit: unknown'
  cpuUtilization:
    idlePercent: 97.52
    systemPercent: 2.08
    userPercent: 0.4
  cpus: 24
  databaseBackend: sqlite
  distribution:
    distribution: alpine
    version: 3.20.1
  eventLogger: file
  freeLocks: 2031
  hostname: build-00
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 6.6.34-1-lts
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 29579984896
  memTotal: 33634459648
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.10.0-r0
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.10.0
    package: netavark-1.10.3-r0
    path: /usr/libexec/podman/netavark
    version: netavark 1.10.3
  ociRuntime:
    name: crun
    package: crun-1.15-r0
    path: /usr/bin/crun
    version: |-
      crun version 1.15
      commit: e6eacaf4034e84185fd8780ac9262bbf57082278
      rundir: /run/crun
      spec: 1.0.0
      +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-2024.06.07-r0
    version: |
      pasta unknown version
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 0
  swapTotal: 0
  uptime: 191h 35m 22.00s (Approximately 7.96 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 11
    paused: 0
    running: 11
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev
  graphRoot: /media/md1/containers/storage
  graphRootAllocated: 146415128576
  graphRootUsed: 4874047488
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "true"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 11
  runRoot: /media/md1/containers-runroot/storage
  transientStore: false
  volumePath: /media/md1/containers/storage/volumes
version:
  APIVersion: 5.0.3
  Built: 1717594599
  BuiltTime: Wed Jun  5 13:36:39 2024
  GitCommit: ""
  GoVersion: go1.22.4
  Os: linux
  OsArch: linux/amd64
  Version: 5.0.3

Let me know if more info is needed. Thanks!

Luap99 commented 2 months ago

It is likely the same cause as #811 but given your error is different we cannot be for sure. When #811 is fixed you should definitely retest.

jjzazuet commented 2 months ago

One more crash occurrence today with latest netavark build in case it helps:

ul 15 22:16:10 build-00 user.notice dhcp-proxy: thread 'thread 'tokio-runtime-worker<unnamed>' panicked at ' panicked at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/thread/mod.rslibrary/std/src/sys/pal/unix/stack_overflow.rs::683168::2913:
Jul 15 22:16:10 build-00 user.notice dhcp-proxy: :
Jul 15 22:16:10 build-00 user.notice dhcp-proxy: failed to spawn thread: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }failed to set up alternative stack guard page: Out of memory (os error 12)
Jul 15 22:16:10 build-00 user.notice dhcp-proxy: stack backtrace:
Jul 15 22:16:10 build-00 user.notice dhcp-proxy: memory allocation of 3072 bytes failed
Jul 15 22:16:11 build-00 daemon.err /etc/init.d/netavark-dhcp-proxy[17725]: start-stop-daemon: failed to start `/opt/netavark'
Jul 15 22:16:11 build-00 user.notice dhcp-proxy:  * start-stop-daemon: failed to start `/opt/netavark'

Thanks!