nerdctl run with --cap-add NET_BIND_SERVICE not working

Description

I have several linuxserver-based containers whose unprivileged services bind to port 80 inside the container, so I can access them through a VPN without having to add port numbers to my URL's. This setup has been working without issue on docker.

Now I'm moving to containerd (docker support is being dropped on truenas scale) and most of my containers fail to bind to port 80.

I modified my run commands to use --cap-add NET_BIND_SERVICE as instructed in the containerd github page, but the containers still fail to bind.

I can use docker inspect on the old containers to confirm that NET_BIND_SERVICE is present, but nerdctl inspect does not return any CapAdd field.

Steps to reproduce the issue

Configure a container with an unprivileged service that it runs on port 80 internally
Launch the container using nerdctl run --cap-add NET_BIND_SERVICE
Watch the initialization logs of the container

Describe the results you received and expected

I expected the unprivileged service to bind to port 80 / 443, but it doesn't.

What version of nerdctl are you using?

1.5.0

Are you using a variant of nerdctl? (e.g., Rancher Desktop)

None

Host information

Client:
 Namespace:     default
 Debug Mode:    false

Server:
 Server Version: 1.6.8
 Storage Driver: overlayfs
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Log: fluentd journald json-file syslog
  Storage: native overlayfs zfs
 Security Options:
  apparmor
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.15.107+truenas
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.41GiB

HI @Caian

The nerdctl cap can work correctly in the environment.

root@kay201:~# nerdctl run --cap-add NET_BIND_SERVICE,CHOWN,DAC_OVERRIDE,SETGID,SETUID --cap-drop ALL -d --name haha -p 80:80 docker.m.daocloud.io/nginx:alpine
0b016f6b89ec031c880fcc0c6aaf5deb6538dd736aa94381218a85adc3defe11
root@kay201:~# nerdctl ps
CONTAINER ID    IMAGE                                COMMAND                   CREATED          STATUS    PORTS                 NAMES
0b016f6b89ec    docker.m.daocloud.io/nginx:alpine    "/docker-entrypoint.…"    4 seconds ago    Up        0.0.0.0:80->80/tcp    haha
root@kay201:~# curl 127.0.0.1:80
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }

And the nerdctl inspect cannot show the cap information, but the ctr c info can.

Would you please tell us more detail about the steps to reproduce the issue :-)

==> nerdctl inspect --mode=native

I modified my run commands to use --cap-add NET_BIND_SERVICE as instructed in the containerd github page, but the containers still fail to bind.

I expected the unprivileged service to bind to port 80 / 443, but it doesn't.

Since there's no feedback with more details, it appears @Caian did not add the capability when using Docker prior (since it was not necessary).

With Docker, the sysctl net.ipv4.ip_unprivileged_port_start (default 1024) is dropped to 0 - allowing any unprivileged process to bind the typical privileged ports without being granted CAP_NET_BIND_SERVICE.
With their containerd attempt, I assume that was the default 1024 and they've only granted the capability to root which their image is not running the binding process with (where the capability will not be in the effective set).

Solutions

Set the sysctl option --sysctl net.ipv4.ip_unprivileged_port_start=0 to bypass the need for the capability (all processes within the container can then bind to these ports, similar to if ambient capabilities were supported).
If the services link to libc and are not scratch / distroless like images, they could probably use authbind (useful when the service that binds is script based like Python, JS, shell, etc).
Use setcap cap_net_bind_service=ep file_name to grant the capability to Permitted and Effective sets on the executable (useful for software built as a static binary without libc, common with Rust/Go). This is considered a "capability-dumb" approach when there is no control for the software to be capability aware. The drawback is the kernel enforces a check for the permitted capability being effective for the process before the executable runs, even when the program runs without actually needing the capability (a user binds to an unprivileged port, dropping all capabilities as a security measure).
Use setcap cap_net_bind_service=p file_name when the program is capable of observing it's Permitted set and raising the needed capability into the Effective set. This is ideal when Ambient capabilities cannot be used (commonly not supported within containers, nor do you necessarily want to grant ambient capabilities process-wide).
Run as root with all capabilities dropped except for those needed. Similar to the potential for Ambient support, this is less viable with most base images, but may be acceptable for scratch or certain distroless variants. The majority of container vulnerabilities that motivate users to adopt a non-root user are reliant upon adequate capabilities being granted, which can still be exploited from a non-root user 🤷‍♂️

NOTE: The setcap approach for file-based capabilities:

Will remove LD_PRELOAD and LD_LIBRARY_PATH environment variables on binaries linked to libc (_verify with ldd file_name_), which depending on the software may introduce a regression.
On some systems (like a Synology NAS) setcap is not able to be used in an image build, likely due to AUFS + kernel).
User-namespaced containers require kernel 4.14

Ambient capabilities requires at least kernel 4.3, and the sysctl requires at least kernel 4.11.

I modified my run commands to use --cap-add NET_BIND_SERVICE as instructed in the containerd github page, but the containers still fail to bind.

I expected the unprivileged service to bind to port 80 / 443, but it doesn't.

Since there's no feedback with more details, it appears @Caian did not add the capability when using Docker prior (since it was not necessary).

* With Docker, the sysctl `net.ipv4.ip_unprivileged_port_start` (default `1024`) is dropped to `0` - allowing any unprivileged process to bind the typical privileged ports without being granted `CAP_NET_BIND_SERVICE`.

* With their `containerd` attempt, I assume that was the default `1024` and they've only granted the capability to root which their image is not running the binding process with (_where the capability will not be in the effective set_).

Solutions

* Set the sysctl option `--sysctl net.ipv4.ip_unprivileged_port_start=0` to bypass the need for the capability (_all processes within the container can then bind to these ports, similar to if ambient capabilities were supported_).

* If the services link to `libc` and are not `scratch` / `distroless` like images, they could probably [use `authbind`](https://stackoverflow.com/questions/413807/is-there-a-way-for-non-root-processes-to-bind-to-privileged-ports-on-linux/27989419#27989419) (_useful when the service that binds is script based like Python, JS, shell, etc_).

* Use `setcap cap_net_bind_service=ep file_name` to grant the capability to Permitted and Effective sets on the executable (_useful for software built as a static binary without `libc`, common with Rust/Go_). This is considered a ["capability-dumb"](https://man7.org/linux/man-pages/man7/capabilities.7.html) approach when there is no control for the software to be capability aware. The drawback is the kernel enforces a check for the permitted capability being effective for the process before the executable runs, even when the program runs without actually needing the capability (_a user binds to an unprivileged port, dropping all capabilities as a security measure_).

* Use `setcap cap_net_bind_service=p file_name` when the program is capable of observing it's Permitted set and raising the needed capability into the Effective set. This is ideal when Ambient capabilities cannot be used (_commonly not supported within containers, nor do you necessarily want to grant ambient capabilities process-wide_).

* Run as root with all capabilities dropped except for those needed. Similar to the potential for Ambient support, this is less viable with most base images, but may be acceptable for `scratch` or certain `distroless` variants. The majority of container vulnerabilities that motivate users to adopt a non-root user are reliant upon adequate capabilities being granted, which can still be exploited from a non-root user 🤷‍♂️

NOTE: The setcap approach for file-based capabilities:

* Will remove `LD_PRELOAD` and `LD_LIBRARY_PATH` environment variables on binaries linked to `libc` (_verify with `ldd file_name`_), which depending on the software may introduce a regression.

* On some systems (_like a [Synology NAS](https://github.com/caddyserver/caddy-docker/issues/290#issuecomment-1504845336)) `setcap` is not able to be used in an image build, likely due to AUFS + kernel_).

Sorry, I forgot to answer to the thread. Yes, I ended up adding --sysctl net.ipv4.ip_unprivileged_port_start=0 instead of using --cap-add, which solved the issue.

@AkihiroSuda there is no bug here, we should close.

@Caian was able to get what they want with ip_unprivileged_port_start (which is the docker behavior) and @polarathene provided great details about why just using the cap on a random image will not work.

containerd / nerdctl