containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
22.43k stars 2.31k forks source link

Container start failure: pasta fails to handle OSPF routes #22960

Closed cyqsimon closed 2 weeks ago

cyqsimon commented 3 weeks ago

Issue Description

On a Linux system with OSPF routes, podman start/run fails with the following error:

ERRO[0000] Starting some container dependencies
ERRO[0000] "setting up Pasta: pasta failed with exit code 1:\nCouldn't set IPv4 route(s) in guest: Invalid argument\n"
Error: unable to start container "42706af169e8170399cab233d97d3b052407769123c788d59a8bb4d4acbe4010": starting some containers: internal libpod error
Error: unable to start container "b54f47f2fe46cf14337e53c6a608fc0b3d04a2f1c3d465e5db9bd90a62dcb7b3": setting up Pasta: pasta failed with exit code 1:
Couldn't set IPv4 route(s) in guest: Invalid argument

Steps to reproduce the issue

This requires setting up OSPF which may be quite a lot of work, but here's the procedure anyways. Some useful links: Practical OSPF, FRR OSPFv2 user guide.

  1. Connect two Linux machines (virtual or physical) to the same subnet. For this example, the subnet is 10.20.30.0/24.
  2. Machine A (any Linux with FRR, e.g. Fedora 40) is our helper machine. On it, install frr and configure it to distribute a route. For this example, its IP on the subnet is 10.20.30.100 and the distributed route is 10.40.50.0/24.
  3. Machine B (any Linux with FRR and podman 5, e.g. Fedora 40) is the one running podman. On it, install frr and configure it to receive the route. For this example, its IP on the subnet is 10.20.30.101.
  4. Verify the routing table on machine B now looks something like this. Notice the last route added via OSPF.
    default via 10.20.30.1 dev enp1s0 proto static metric 100
    10.20.30.0/24 dev enp1s0 proto kernel scope link src 10.20.30.101 metric 100
    10.40.50.0/24 nhid 37 via 10.20.30.100 dev enp1s0 proto ospf metric 20
  5. Try to start a container on machine B: podman run quay.io/podman/hello and observe the error.
  6. On machine B, run sudo systemctl stop frr.service && sudo ip route del 10.40.50.0/24 to remove the OSPF route.
  7. Try to start a container again and observe that it succeeds.

Describe the results you received

N/A

Describe the results you expected

N/A

podman info output

host:
  arch: amd64
  buildahVersion: 1.36.0
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.10-1.fc40.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.10, commit: '
  cpuUtilization:
    idlePercent: 94.85
    systemPercent: 1.01
    userPercent: 4.14
  cpus: 4
  databaseBackend: sqlite
  distribution:
    distribution: fedora
    variant: kde
    version: "40"
  eventLogger: journald
  freeLocks: 2044
  hostname: foo.bar.baz
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 524288
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 524288
      size: 65536
  kernel: 6.8.10-300.fc40.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 891052032
  memTotal: 8299536384
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.10.0-1.fc40.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.10.0
    package: netavark-1.10.3-3.fc40.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.10.3
  ociRuntime:
    name: crun
    package: crun-1.15-1.fc40.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.15
      commit: e6eacaf4034e84185fd8780ac9262bbf57082278
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20240510.g7288448-1.fc40.x86_64
    version: |
      pasta 0^20240510.g7288448-1.fc40.x86_64
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: false
    path: /run/user/1000/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.2-2.fc40.x86_64
    version: |-
      slirp4netns version 1.2.2
      commit: 0ee2d87523e906518d34a6b423271e4826f71faf
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 6668939264
  swapTotal: 8589930496
  uptime: 311h 45m 51.00s (Approximately 12.96 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
store:
  configFile: /home/cyq/.config/containers/storage.conf
  containerStore:
    number: 2
    paused: 0
    running: 0
    stopped: 2
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/cyq/.local/share/containers/storage
  graphRootAllocated: 124817354752
  graphRootUsed: 24726548480
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 2
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/cyq/.local/share/containers/storage/volumes
version:
  APIVersion: 5.1.0
  Built: 1716940800
  BuiltTime: Wed May 29 08:00:00 2024
  GitCommit: ""
  GoVersion: go1.22.3
  Os: linux
  OsArch: linux/amd64
  Version: 5.1.0

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

No

Additional environment details

No response

Additional information

This seems like a very similar issue to #22192, so it likely needs to be fixed in pasta. But I'm not sure how to report directly to that project, so here's this issue.

Luap99 commented 3 weeks ago

This seems like a very similar issue to https://github.com/containers/podman/issues/22192, so it likely needs to be fixed in pasta. But I'm not sure how to report directly to that project, so here's this issue.

https://passt.top/passt/about/#contribute

Regardless the maintainers are active here so this issue here is good enough.

cc @sbrivio-rh @dgibson

sbrivio-rh commented 3 weeks ago

10.40.50.0/24 nhid 37 via 10.20.30.100 dev enp1s0 proto ospf metric 20

Weird, this is the first time I see a route with a nexthop identifier but a single nexthop (i.e. not multipath). On the other hand, it's been a while since the last time I played with OSPF.

@cyqsimon, would you be so kind as to run strace -e recvmsg ip -4 route show and report the RTM_NEWROUTE part here? I would need to see how the netlink message looks like in that case.

cyqsimon commented 3 weeks ago

@cyqsimon, would you be so kind as to run strace -e recvmsg ip -4 route show and report the RTM_NEWROUTE part here? I would need to see how the netlink message looks like in that case.

Of course. Here it is. I've replaced the addresses in the message to match my given example. Hope you find this okay.

recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=1718111824, nlmsg_pid=87924}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_STATIC, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("10.20.30.1")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("enp1s0")]]], [{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=1718111824, nlmsg_pid=87924}, {rtm_family=AF_INET, rtm_dst_len=24, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("10.20.30.0")], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("10.20.30.101")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("enp1s0")]]], [{nlmsg_len=76, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=1718111824, nlmsg_pid=87924}, {rtm_family=AF_INET, rtm_dst_len=24, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_OSPF, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("10.40.50.0")], [{nla_len=8, nla_type=RTA_PRIORITY}, 20], [{nla_len=8, nla_type=RTA_NH_ID}, "\x31\x00\x00\x00"], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("10.20.30.100")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("enp1s0")]]]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 204
sbrivio-rh commented 2 weeks ago

@cyqsimon it took me a bit, but thanks to your recvmsg() dump I kind of reproduced this. I didn't really set up FRR with OSPF, but I hardcoded a similar route and it looks like the kernel rejects it because the nexthop identifier (RTA_NH_ID) is not valid in the target namespace. We should strip those attributes, because the target namespace will not have matching identifiers, in general.

To make sure this fixes your issue, could you please try this patch:

diff --git a/netlink.c b/netlink.c
index 4dbddb2..58822e9 100644
--- a/netlink.c
+++ b/netlink.c
@@ -608,6 +608,15 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
                 * route invalid in the namespace.  Strip off
                 * RTA_PREFSRC attributes to avoid that. */
                rta->rta_type = RTA_UNSPEC;
+           } else if (rta->rta_type == RTA_NH_ID) {
+               /* Host routes set up via routing protocols
+                * (e.g. OSPF) might contain a nexthop ID (and
+                * not nexthop objects, which are taken care of
+                * in the RTA_MULTIPATH case above) that's not
+                * valid in the target namespace. Strip those as
+                * well.
+                */
+               rta->rta_type = RTA_UNSPEC;
            }
        }

and see if it fixes the issue for you? You don't have to use Podman or even install a new build. You can just git clone git://passt.top/passt, feed that to patch -p1 or apply manually, and build with make.

Then ./pasta --config-net (when an OSPF-derived route is present on the host) will show you if copying routes to the container now succeeds or not.

cyqsimon commented 2 weeks ago

@sbrivio-rh Yeah your patch fixes it. The self-compiled binary is able to successfully set up the namespace with the correct route now. Thanks for the great work!

I'll leave this issue open for now; please close when you see fit. Thanks again!

Luap99 commented 2 weeks ago

https://passt.top/passt/commit/?id=62de6140d949795ff2595f0652b9c37929a3ce2f

sbrivio-rh commented 2 weeks ago

I'm preparing a release including this fix at the moment, by the way.

sbrivio-rh commented 1 week ago

Fixed in 2024_06_24.1ee2eca, and matching Fedora 40 update.