containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.11k stars 2.36k forks source link

[rootless] netlink broadcast not working #23904

Open colinmarc opened 1 week ago

colinmarc commented 1 week ago

Issue Description

I'm working on a hack sophisticated bit of software that broadcasts udev events via a AF_NETLINK, SOCK_RAW socket. I've attached some very rough/hacky rust code at the bottom which should be easy enough to compile and run.

The test program works fine on the host machine (as root), as well as in a normal netns, for example with the following steps:

$ sudo ip netns add foo
$ sudo ip netns exec foo udevadm monitor

(and then in another shell)

$ sudo ip netns exec sudo ip netns exec foo ./hack

However, it doesn't seem to work in rootless Podman network namespaces. I tried both --net=pasta and --net=slirp4netns.

For background, uevents are used to provide hotplug information to running processes. For example, if you plug in a keyboard, it will generate a bunch of them. As of recent-ish kernels, those events are "namespaced" in the sense that they are only broadcast to listening sockets in the same network namespace.

Steps to reproduce the issue

Start a podman container:

$ podman run --privileged --cap-add=NET_ADMIN  -v ~/path/to/hack:/root/hack -it ubuntu

In the container, install a few tools and then run udevadm monitor:

# apt update && apt install udev
# udevadm monitor

And in another shell, run the test:

$ podman exec -it --latest bash
# /root/hack

Describe the results you received

udevadm didn't pick up any events.

Describe the results you expected

On the host, or in a normal network namespace, this produces the following:

monitor will print the received events for:
UDEV - the event which udev sends out after rule processing
KERNEL - the kernel uevent

UDEV  [52346.869853] add      /tmp/eventxxx (input)

podman info output

host:
  arch: amd64
  buildahVersion: 1.37.2
  cgroupControllers:
  - cpu
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-1:2.1.12-1
    path: /usr/bin/conmon
    version: 'conmon version 2.1.12, commit: e8896631295ccb0bfdda4284f1751be19b483264'
  cpuUtilization:
    idlePercent: 98.19
    systemPercent: 0.36
    userPercent: 1.45
  cpus: 16
  databaseBackend: sqlite
  distribution:
    distribution: manjaro
    version: unknown
  eventLogger: journald
  freeLocks: 2025
  hostname: baldanders
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 6.8.12-3-MANJARO
  linkmode: dynamic
  logDriver: journald
  memFree: 1313931264
  memTotal: 33569259520
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.12.1-1
      path: /usr/lib/podman/aardvark-dns
      version: aardvark-dns 1.12.1
    package: netavark-1.12.2-1
    path: /usr/lib/podman/netavark
    version: netavark 1.12.2
  ociRuntime:
    name: crun
    package: crun-1.16.1-1
    path: /usr/bin/crun
    version: |-
      crun version 1.16.1
      commit: afa829ca0122bd5e1d67f1f38e6cc348027e3c32
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-2024_08_21.1d6142f-1
    version: |
      pasta 2024_08_21.1d6142f
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: false
    path: /run/user/1000/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.3.1-1
    version: |-
      slirp4netns version 1.3.1
      commit: e5e368c4f5db6ae75c2fce786e31eef9da6bf236
      libslirp: 4.8.0
      SLIRP_CONFIG_VERSION_MAX: 5
      libseccomp: 2.5.5
  swapFree: 0
  swapTotal: 0
  uptime: 35h 39m 53.00s (Approximately 1.46 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries: {}
store:
  configFile: /home/colinmarc/.config/containers/storage.conf
  containerStore:
    number: 23
    paused: 0
    running: 0
    stopped: 23
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/colinmarc/.local/share/containers/storage
  graphRootAllocated: 983038173184
  graphRootUsed: 523265712128
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 6
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/colinmarc/.local/share/containers/storage/volumes
version:
  APIVersion: 5.2.2
  Built: 1724352649
  BuiltTime: Thu Aug 22 20:50:49 2024
  GitCommit: fcee48106a12dd531702d729d17f40f6e152027f
  GoVersion: go1.23.0
  Os: linux
  OsArch: linux/amd64
  Version: 5.2.2

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

No response

Additional information

Here's the rust code:

use rustix::fd::{AsRawFd as _, FromRawFd as _};

const UDEV_EVENT_MODE: u32 = 2;

#[repr(C)]
struct Header {
    prefix: [u8; 8],
    magic: [u8; 4],
    header_size: u32,
    properties_off: u32,
    properties_len: u32,
    filter_subsystem_hash: [u8; 4],
    filter_devtype_hash: [u8; 4],
    filter_tag_bloom_hi: u32,
    filter_tag_bloom_lo: u32,
}

const PREFIX: [u8; 8] = [b'l', b'i', b'b', b'u', b'd', b'e', b'v', 0];
const MAGIC: [u8; 4] = 0xfeedcafe_u32.to_be_bytes();

fn main() -> std::io::Result<()> {
    let mut addr = unsafe {
        let mut sa: libc::sockaddr_nl = std::mem::zeroed();
        sa.nl_family = libc::AF_NETLINK as u16;
        sa.nl_groups = UDEV_EVENT_MODE;

        sa
    };

    let sock = unsafe {
        let sock = libc::socket(
            libc::AF_NETLINK,
            libc::SOCK_RAW,
            netlink_sys::protocols::NETLINK_KOBJECT_UEVENT as i32,
        );

        if sock < 0 {
            return Err(std::io::Error::last_os_error());
        }

        let mut sa: libc::sockaddr_nl = std::mem::zeroed();
        sa.nl_family = libc::AF_NETLINK as u16;
        sa.nl_groups = UDEV_EVENT_MODE;

        let res = libc::bind(
            sock,
            &sa as *const libc::sockaddr_nl as *const _,
            size_of_val(&sa) as u32,
        );

        if res < 0 {
            return Err(std::io::Error::last_os_error());
        }

        std::os::fd::OwnedFd::from_raw_fd(sock)
    };

    let msg =
        "ACTION=add\0DEVNAME=input/foobar\0DEVPATH=/tmp/eventxxx\0SEQNUM=1234\0SUBSYSTEM=input\0";
    let msg_bytes = msg.as_bytes();

    let header = Header {
        prefix: PREFIX,
        magic: MAGIC,
        header_size: size_of::<Header>() as u32,
        properties_off: size_of::<Header>() as u32,
        properties_len: msg.len() as u32,
        filter_subsystem_hash: murmur2::murmur2ne("input\0".as_bytes(), 0).to_be_bytes(),
        filter_devtype_hash: 0_u32.to_be_bytes(),
        filter_tag_bloom_hi: 0,
        filter_tag_bloom_lo: 0,
    };

    let mut out = vec![0_u8; size_of::<Header>() + msg_bytes.len()];
    out[size_of::<Header>()..].copy_from_slice(msg_bytes);
    unsafe {
        std::ptr::write(out.as_mut_ptr() as _, header);
    }

    let out_iovec = rustix::io::IoSlice::new(&out);
    let out_iovecs = &mut [out_iovec];
    unsafe {
        let mut hdr: libc::msghdr = std::mem::zeroed();
        hdr.msg_name = &mut addr as *mut libc::sockaddr_nl as *mut _;
        hdr.msg_namelen = size_of_val(&addr) as u32;
        hdr.msg_iov = out_iovecs.as_mut_ptr() as *mut _;
        hdr.msg_iovlen = out_iovecs.len();
        let res = libc::sendmsg(sock.as_raw_fd(), &hdr, 0);
        if res < 0 {
            return Err(std::io::Error::last_os_error());
        }
    }

    Ok(())
}
colinmarc commented 1 week ago

More debugging. If I run both hack and udevadm monitor via sudo nsenter -t $PID -n ..., then the events are propagated. Neither running udevadm monitor via the container shell and hack via nsenter or the inverse, seem to work.

sbrivio-rh commented 1 week ago

If I run both hack and udevadm monitor via sudo nsenter -t $PID -n ..., then the events are propagated.

Does $PID there represent the network namespace associated to a Podman container, or it's the one from ip netns add?

To me it looks like the difference is whether you detach the network namespace as root, or together with a user namespace as non-root. In the second case, I guess, the kernel will not accept your crafted netlink message.

colinmarc commented 1 week ago

Does $PID there represent the network namespace associated to a Podman container, or it's the one from ip netns add?

Yes, the podman container namespace, fetched via podman inspect --latest --format "{{.State.Pid}}".

I get the same result running completely rootless with nsenter -t $PID -U -n ... (no sudo). If I run both the listener and the broadcaster that way, it works fine. Running either end inside the container shell doesn't. That's despite supposedly having the right caps in the container shell:

root@c8564d96e73b:/# grep Cap /proc/$BASHPID/status
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000

Edit: also, just to add, running this without caps (no --privileged or without sudo on the host) results in a clear EPERM from the kernel. This is just failing silently.