[seccomp] update defaults to current docker rules (Started as: block unshare by default)

containers / common

Location for shared common files in github.com/containers repos.

Apache License 2.0

174 stars 181 forks source link

[seccomp] update defaults to current docker rules (Started as: block unshare by default) #1988

Open martinetd opened 2 months ago

martinetd commented 2 months ago

Hi,

I've noticed that on my systems (fedora, debian, alpine) it's possible to get network admin privileges in a user namespace within a container:

$ podman run --rm -ti docker.io/alpine
/ # apk add iptables
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/community/x86_64/APKINDEX.tar.gz
(1/4) Installing libmnl (1.0.5-r2)
(2/4) Installing libnftnl (1.2.6-r0)
(3/4) Installing libxtables (1.8.10-r3)
(4/4) Installing iptables (1.8.10-r3)
Executing busybox-1.36.1-r15.trigger
OK: 10 MiB in 19 packages
/ # unshare -Urn
526a5598f862:/# iptables -S
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT

I'd have expected this to be blocked, and looking at the git history there was some attempt at making unshare only allowed for containers with CAP_SYS_ADMIN but for some reason it was also duplicated and allowed in the general case, and that got cleaned up "the wrong way" in bf297c191ea46d0c0b252a1549dc4cfe6733dfe0

I've checked the latest docker rules (running docker), and they block unshare properly by default, so it looks like a case of its the Right Thing to Do (it's not blocked for sys admin containers, as we originally had)

At this point I checked their seccomp rules and there are quite a few other changes -- I believe https://github.com/containers/common/blob/main/pkg/seccomp/default_linux.go was originally based on https://github.com/mody/moby/blob/main/profiles/seccomp/default_linux.go , but the docker one is quite more strict and has more syscalls allowed only when some caps are given.

I've started updating the file locally, but before I spend more time on this:

what's the policy with using docker things? In this case I'm looking at what they're doing and so far it all makes sense, the syntax is a bit different but functionally I'll probably end up very close to the docker's default seccomp. Both repos are under apache license so I don't think that'll cause much problem, but I'd like to confirm that first.
Do we want bug-for-bug compatibility (e.g. just copy all their rules), or only take what looks useful? I've started cherry-picking but it might frankly be faster to just sync.

Thanks!

martinetd commented 2 months ago

@rhatdan (or anyone else) - sorry for the direct mention, could you please just tell me if there's a policy wrt using docker rules here or if I should redevelop them from scratch?

I'll do this work anyway, just cannot get started if I don't know what's acceptable/desired :)

Luap99 commented 2 months ago

The most important rule is that we cannot make it more restrictive as this would be a breaking change and most likely break a lot of existing users.

martinetd commented 2 months ago

I can understand where this comes from, but I strongly disagree we should not strive to make the default profile more secure. Saying "they're doing it" isn't a great argument, but docker has also been adding some restrictions following security recommendations e.g. blocking io_uring by default in this commit: https://github.com/moby/moby/commit/891241e7e74d4aae6de5f6125574eb994f25e169 (the commit message explains it better than I would be able to). For unshare() in particular, I was very surprised to see it work given the security implications of getting net admin rights in a user namespace means -- basically a few privilege escalations every year on netfilter bugs, for something almost no container should need (and since it's blocked in docker, people making containers should not rely on it anyway). I see no reason to allow it and the more restrictive we make it the better.

(Note it's not all black, in the differences with docker's default profile there also are some changes towards being more permissive, like allowing the bpf syscall (currently given if CAP_SYS_ADMIN is present) also if CAP_BPF is given; but I'll be honest and say blocking unshare is my main motivation here...)

If you want to prioritize compatibility, something like qemu pc models could make sense: declare a new seccomp-defaults-2024 and new installations can start with it, while systems upgrading from an older system would keep the old one unless they explicitly opt in or podman system reset... But that's a bit more work than I'm able to take on short term, so I'll probably focus on working on a new profile first anyway and patch that locally, then eventually work towards such a goal if that's the direction you prefer.

rhatdan commented 2 months ago

SECCOMP rules are probably the hardest for people to deal with, since it is very difficult to customize. In 10 years there has been very little movement in seccomp rules for this reason. When new syscalls get added to the kernel, then there are modifications to the seccomp rules. But other then that nothing. When seccomp blocks stuff normal users react in one of two ways. Either --privileged which is worse, or --security-opt seccomp=unconfined. In very rare cases they will create a customer seccomp.json file for the container.

Docker decided to block a few syscalls like unshare and mount, which are needed to run podman and other container engines within a container, a fairly common pattern. DIND pattern tends to be volume mounting in the hosts /var/run/docker.sock which does not require unshare or mount. But if you want to do PINP you need these syscalls.

I don't think giving these syscalls by default is very dangerous since we also have limited Capabilties (No SYS_ADMIN by default) and tools like SELinux to futher lock down. Also these syscalls are available to unprivileged users by default, so they are heavily monitored for vulnerabilities.

A few years ago I gave a talk about Goldi-locks and the three bears which talked about the level of security we can achieve where uniformed users (the majority) only response to permission denied is to run --privileged. (Momma Bear) While the goal is to increase security towards lock down (Papa Bear) we need to be very careful about changing defaults.

Luap99 commented 2 months ago

I can understand where this comes from, but I strongly disagree we should not strive to make the default profile more secure.

I am not against making the profile more secure, I am against breaking people and denying unshare will certainly have the effect.

Saying "they're doing it" isn't a great argument, but docker has also been adding some restrictions following security recommendations e.g. blocking io_uring by default in this commit: https://github.com/moby/moby/commit/891241e7e74d4aae6de5f6125574eb994f25e169 (the commit message explains it better than I would be able to).

Well given io_uring is rather new I don't think many applications have a hard requirement on it and I could assume that they all have a fallback to the older io model's. And well in this case we can argue we did the right thing then by never allowing them to begin with: https://github.com/containers/common/pull/1264

For unshare() in particular, I was very surprised to see it work given the security implications of getting net admin rights in a user namespace means -- basically a few privilege escalations every year on netfilter bugs, for something almost no container should need (and since it's blocked in docker, people making containers should not rely on it anyway). I see no reason to allow it and the more restrictive we make it the better.

Yes I understand it but I don't think it justifies breaking users that depend on it, i.e. nested containers.

If you want to prioritize compatibility, something like qemu pc models could make sense: declare a new seccomp-defaults-2024 and new installations can start with it, while systems upgrading from an older system would keep the old one unless they explicitly opt in or podman system reset... But that's a bit more work than I'm able to take on short term, so I'll probably focus on working on a new profile first anyway and patch that locally, then eventually work towards such a goal if that's the direction you prefer.

I think having different pre-defined profiles to choose from makes a lot of sense given most people will never edit their own profiles. The default profile is installed at /usr/share/containers/seccomp.json or /etc/containers/seccomp.json so anyone can just change it if they like without to much effort. Distros could also patch it if they want stricter defaults. I am also open to consider stricter changes for a major version, i.e. podman 6.0.

rhatdan commented 2 months ago

I think the best idea would be to have multiple seccomp.json files that users could easily choose from. But who is going to maintain them, and what should they be called and who is going to define them.

seccomp-strict.json (Drop lots of questionable seccomp rules). seccomp.json (default)

rhatdan commented 2 months ago

Bottom line is users hitting a permission denied will have a hard time understanding what is going wrong and have little recourse to get their container running.

https://www.redhat.com/sysadmin/container-permission-denied-errors

martinetd commented 2 months ago

Thank you both for the thorough answers. It's getting late here so I'll check for further replies tomorrow, but it's great to discuss this.

I don't think giving these syscalls by default is very dangerous since we also have limited Capabilties (No SYS_ADMIN by default) and tools like SELinux to futher lock down.

Unfortunately, fedora with selinux enforced or debian with apparmor enforced do not block inserting netfilter rules as a user in container with unshare -Urn, there might be room for improvement on these two sides but I would really like to at least see iptables/nft blocked somehow.

Also these syscalls are available to unprivileged users by default, so they are heavily monitored for vulnerabilities.

There are two sides of this coin:

the most vulnerable processes like chrome will run their own sandbox (bubblewrap), so even if chrome gets taken over the malicious code will not be able to unshare() further
even if these syscalls get monitored, there really are many netfilters security bugs -- I'm maintaining our kernel so I see how many netfilter security issues we're fixing and there really is one every few months: https://www.opencve.io/cve?vendor=linux&product=linux_kernel&cvss=&search=netfilter Even if we update quickly and generally have released a fix before expoits and write-ups are made public (generally 2-3 months after the fix is public), "serious" attackers monitoring the fixes get a small windows to abuse these for every new exploit, so blocking the feature is much safer if possible.

DIND pattern tends to be volume mounting in the hosts /var/run/docker.sock which does not require unshare or mount. But if you want to do PINP you need these syscalls.

That's a very good point, I didn't think of PINP... We have a podman flag to allow systemd (and try to automatically detect if it runs), perhaps we could have a similar --podman-in-podman=auto flag that allows these in this case, or just another profile for seccomp=PINP ? I'm not quite sure how the detection would work though, perhaps have images define a tag saying which security profile they want like flatpak does? so user can override profiles, but by default images would work, and images not requesting anything will get something more restrictive... But that will require some communication and take time.

I also agree we want to avoid people turning everything off (which is indeed much worse), but I really don't see so many users of unshare() that we cannot work something out.

I think having different pre-defined profiles to choose from makes a lot of sense given most people will never edit their own profiles. The default profile is installed at /usr/share/containers/seccomp.json or /etc/containers/seccomp.json so anyone can just change it if they like without to much effort. Distros could also patch it if they want stricter defaults. I am also open to consider stricter changes for a major version, i.e. podman 6.0.

I agree the hurdle to writing one's own profile is too high, shipping multiple profiles and letting users switch more easily would be a great first step.

I'll try to finish looking at what's interesting in the docker's default seccomp policy and recap the changes early of next week, so we can see how much tuning would make sense; I don't think there'll be much more than mount and unshare (basically stuff required for nested podman) but it'll be good to confirm.

sbrivio-rh commented 2 months ago

For unshare() in particular, I was very surprised to see it work given the security implications of getting net admin rights in a user namespace means -- basically a few privilege escalations every year on netfilter bugs, for something almost no container should need (and since it's blocked in docker, people making containers should not rely on it anyway). I see no reason to allow it and the more restrictive we make it the better.

Even as the one guilty for CVE-2023-4004 (sigh, sorry), I'm not convinced that the alternative (blocking unshare(CLONE_NEWUSER)) is, in general, a better option, because you might also discourage applications from sandboxing themselves by doing exactly that -- detaching namespaces (passt/pasta do that too).

And while @Luap99 already mentioned some concerns related to Docker-in-Docker (or Podman-in-Podman), I wanted to add a specific one: unshare(CLONE_NEWUSER | CLONE_NEWNET) is exactly what allows rootless (network-wise) containers to exist. If we make those less appealing (even just from a compatibility/adoption perspective), we risk creating a bigger problem.

But yes, I agree that shipping multiple profiles is probably a good idea. Side note, I'm spending some effort on-and-off on a project that should make writing/deploying those profiles a bit simpler, by the way.

martinetd commented 2 months ago

Even as the one guilty for https://github.com/advisories/GHSA-h3j8-wx8r-29j6 (sigh, sorry), I'm not convinced that the alternative (blocking unshare(CLONE_NEWUSER)) is, in general, a better option, because you might also discourage applications from sandboxing themselves by doing exactly that -- detaching namespaces (passt/pasta do that too).

I guess it's a matter of where the container stands here -- I agree that the final application can do a much better job than we can at isolating itself, but most don't even try (mostly because seccomp is so hard to use; but even if something like https://justine.lol/pledge/ was made more standard many developers just don't care so I'm not hoping much here).

If the application legitimately uses unshare to isolate themselves then blocks unshare (as in bwrap [isolation] --disable-userns) then by all mean I'll want to allow that! However reality is that most don't, so if we're considering the container technology to provide isolation/security then I still think we're better off disabling namespaces in most case. (There's also been discussions about a max namespace depth in https://www.openwall.com/lists/oss-security/2024/04/19/4 )

So it really depends on where podman wants to place itself, I still think it makes sense to be strict about these and have a --I'm-securing-myself tag for applications (not people!) who know what they do, but given containers have no way of specifying such a tag right now I'd rather be realistic and let's start with profiles first :)

rhatdan commented 2 months ago

Could you prepare a seccomp-strict.json file and then we could package it up and allow users to choose it. Perhaps even make it easy in containers.conf to switch to it by default. Or at least tell users to copy /usr/share/containers/seccomp-strict.json /etc/containers/seccomp.json.

But I would want a lot more then just unshare and mount syscalls turned off for a strict mode. Once this is done then we need it to be maintained. Perhaps no new syscalls are added, since that would be stricter.

martinetd commented 2 months ago

Sorry for the delay!

I've gone through the differences with docker first, given it'll be a new profile that'll be hard to review so recapping the differences here to decide what we want in default profile and what we want in "hardened" profile.

Unopiniated list:

new (or not so new) syscalls allowed in docker: cachestat, futex_requeue, futex_wait, futex_waitv, futex_wake, io_pgetevents, io_pgetevents_time64, map_shadow_stack, vmsplice
the pass-all sycall is allowed in our default profile -- this allows calling any syscall, rendering this seccomp filter useless afaiu?
perf_event_open is allowed for CAP_PERFMON (and CAP_SYS_ADMIN as currently done)
bpf is allowed for CAP_BPF (and CAP_SYS_ADMIN as currently done)
syslog is allowed for CAP_SYSLOG (and removed from default allow list)
mbind, get_mempolicy, set_mempolicy are allowed for CAP_SYS_NICE (and removed from default allow list)
socket with AF_VSOCK are denied for everyone
readdir is denied on docker (this syscall is obsolete and no longer used on any architecture)
timerfd is denied on docker (appears to be specific to s390, mips, ia64 and alpha which also have the timerfd_create and friends variants, not used by glibc/musl at least)
sigaction, sigpending, and sigsuspend are denied on docker (appears to be arm64-only syscalls not used by glibc/musl at least)
riscv_flush_icache is allowed for riscv arch (currently denied)
swapcontext is allowed for ppc64le (currently denied)
pidfd_getfd is allowed for CAP_SYS_PTRACE (and removed from default allow list)
process_vm_readv, process_vm_writev and ptrace apparently had a bug with linux < 4.8 so are only allowed for CAP_SYS_PTRACE or kernels newer than this -- given no such kernels are maintained currently we can probably ignore this and leave them in default list
query_module is not allowed at all (allowed for CAP_SYS_MODULE currently)
pivot_root is denied on docker (allowed by default currently)
reboot is allowed for CAP_SYS_BOOT (and removed from default allow list)
podman-in-podman-related options:
- without CAP_SYS_ADMIN, clone3 is denied and clone is only allowed without CLONE_NEWNS and various namespace options
- other namespace-related things allowed only for SYS_ADMIN: unshare, setns
- mount-related things only for SYS_ADMIN: mount, umount, umount2, fsconfig, fsmount, fsopen, fspick, mount_setattr, mount_move, open_tree

I'd say we can do this in two steps.

1/ most of the riff raff can probably go to the main profile without much ceremony:

removing obsolete syscalls, restricting reboot to CAP_SYS_BOOT etc really shouldn't break anything
on getting more permissive with new syscalls, I'll take some time to check what they do but we can probably leave closed until someone complains except perhaps for the futex* ones? I assume libcs will fallback anyway, but this one sounds good to have, it's just too recent to have people complain about performance yet... musl in particular don't use the new syscalls yet (kernel 6.7+), so this won't affect me much anyway.
I also need to check the arch specific stuff
We can discuss the specific in said PR, that'll be easier to see point by point than replying to my long list

2/ once that's in, add a new hardened profile with the last points (clone, mount stuff, ns stuff) denied; I'll look at other commonly allowed syscalls and see if there's more to be moved to SYS_ADMIN-only.

I'm not sure about "never adding more stuff", honestly that'll probably have to be done on a case by case basis, but I agree it'll have to be maintained. A very stupid test I have in some of my repos when I need to keep two files updated is to add a check that diffs the two files, and errors if that doesn't match the expected output, meaning that anyone modifying either file needs to also either update the other file or the diff -- this is a bit of a pain but these seccomp rules aren't updated often so that might work. I'll be happy to help as long as I work here, but don't really have the bandwidth to check all the containers/common patches, so if you have another way to ping me when seccomp gets updated I can just review the old way if you prefer.

What do you think?

I'm still swamped, but will try to take time to open the first PR next week-ish.

rhatdan commented 2 months ago

SGTM

martinetd commented 1 month ago

By the way, not coming back on the current short term plan to make a hardened filter for distros to ship as easier-to-use-than-rewriting-from-scatch examples, but when working on this I've come to notice again systemd's SystemCallFilter directive and I really like the grouping work they've done (e.g. @aio, @basic-io, @network-io etc)

I still think that long term something like flatpaks where the containers can request the access rights they need and a user could override this at runtime would be great.

Perhaps we could abuse labels? e.g. in Dockerfile add LABEL syscall_filter="@basic-io,@process" to get these applied by default, but a user could see that (image inspect) and override with security-opt.

I'm not familiar enough with the container world here, is there something such as "well known labels" that already work kind of like this, or would it be more appropriate to add a different category of metadata? Also, where would such an idea need to be discussed to involve not just podman but other participants of the oci container standard?

Until then I'll keep working on a hardened profile as time allows, thanks for the reviews on the first PR