Error during unshare(CLONE_NEWUSER): Operation not permitted

nmiculinic commented 4 years ago

Description

I cannot run buildah bud

Steps to reproduce the issue:

docker run --rm -it ubuntu

Within the docker container I run the following:

https://github.com/containers/buildah/blob/master/install.md#ubuntu

root@dbdb5cd66273:/rootfs/ci/dockerfiles/test# buildah bud -f Dockerfile  .
Error during unshare(CLONE_NEWUSER): Operation not permitted
ERRO[0000] error parsing PID "": strconv.Atoi: parsing "": invalid syntax 
ERRO[0000] (unable to determine exit status)            
root@dbdb5cd66273:/rootfs/ci/dockerfiles/test# buildah --version
buildah version 1.10.1 (image-spec 1.0.1, runtime-spec 1.0.1-dev)
root@dbdb5cd66273:/rootfs/ci/dockerfiles/test# cat /proc/sys/user/max_user_names
paces
62901
root@dbdb5cd66273:/rootfs/ci/dockerfiles/test# cat "/proc/sys/kernel/unprivileged_userns_clone"
1
root@dbdb5cd66273:/rootfs/ci/dockerfiles/test#

Describe the results you expected:

I expected everything to work our and build the OCI image.

Output of rpm -q buildah or apt list buildah:

root@dbdb5cd66273:/rootfs/ci/dockerfiles/test# apt list buildah
Listing... Done
buildah/bionic,now 1.10.1-1~ubuntu18.04~ppa1 amd64 [installed]

Output of buildah version:

buildah version 1.10.1 (image-spec 1.0.1, runtime-spec 1.0.1-dev)

Output of podman version if reporting a podman build issue: not installed

*Output of `cat /etc/release`:**

root@dbdb5cd66273:/rootfs/ci/dockerfiles/test# cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.3 LTS"

Output of uname -a:

root@dbdb5cd66273:/rootfs/ci/dockerfiles/test# uname -a
Linux dbdb5cd66273 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Output of cat /etc/containers/storage.conf:

(( default one ))

root@dbdb5cd66273:/rootfs/ci/dockerfiles/test# cat /etc/containers/storage.conf
# storage.conf is the configuration file for all tools
# that share the containers/storage libraries
# See man 5 containers-storage.conf for more information

# The "container storage" table contains all of the server options.
[storage]

# Default Storage Driver
driver = "overlay"

# Temporary storage location
runroot = "/var/run/containers/storage"

# Primary read-write location of container storage
graphroot = "/var/lib/containers/storage"

[storage.options]
# AdditionalImageStores is used to pass paths to additional read-only image stores
# Must be comma separated list.
additionalimagestores = [
]

# Size is used to set a maximum size of the container image.  Only supported by
# certain container storage drivers (currently overlay, zfs, vfs, btrfs)
size = ""

# OverrideKernelCheck tells the driver to ignore kernel checks based on kernel version
override_kernel_check = "true"

rhatdan commented 4 years ago

We recommend that people running buildah within a locked down container use images from quay.io. https://quay.io/repository/buildah/stable Basically running straight buildah within a locked down container will fail, because the unshare command is blocked. We recommend using the --isolation=chroot, which eliminates the unshare call.

nmiculinic commented 4 years ago

It doesn't seem to help at all:

docker run --rm -it -v $(pwd):/rootfs quay.io/buildah/stable
[root@664c4f767a70 test]# buildah bud --isolation=chroot  -f Dockerfile  .  
Error during unshare(CLONE_NEWUSER): Operation not permitted
ERRO error parsing PID "": strconv.Atoi: parsing "": invalid syntax 
ERRO (unable to determine exit status)

Also it appears to be default isolation as well in the container.

rhatdan commented 4 years ago

Could you try this with podman? Also could you try docker run --security-opt seccomp=/usr/share/containers/seccomp.json --rm -it -v $(pwd):/rootfs quay.io/buildah/stable

I think Docker might be blocking the unshare syscall.

JamesWrigley commented 4 years ago

Not sure if this is the case on Ubuntu, but on Debian the kernel itself disables the unsharing: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=808915.

I had to manually allow unprivileged users to unshare, get Docker to use Podman's seccomp profile, and then Buildah ran in the container. Using --isolation=chroot had no effect, unfortunately.

rhatdan commented 4 years ago

Don't fully understand what you are saying, Did Buildah work or not work within the container?

JamesWrigley commented 4 years ago

Yes it did (on a Debian host), once I ran:

echo 1 > /proc/sys/kernel/unprivileged_userns_clone

I'm not sure why this is necessary if --isolation=chroot eliminates the unshare call.

Then when using Podman's seccomp profile, Buildah worked in the container:

docker run --security-opt seccomp=/usr/share/containers/seccomp.json --rm -it quay.io/buildah/stable

rhatdan commented 4 years ago

@nalind @giuseppe Are we still unsharing the namespace if we are doing --isolation=chroot

giuseppe commented 4 years ago

@nalind @giuseppe Are we still unsharing the namespace if we are doing --isolation=chroot

yes, a new user namespace is still necessary when the user has no CAP_SYS_ADMIN in the container.

rhatdan commented 4 years ago

@giuseppe Why, what do we need this for? I guess we are still bind mounting the /proc and /sys into the chroot.

giuseppe commented 4 years ago

@giuseppe Why, what do we need this for? I guess we are still bind mounting the /proc and /sys into the chroot.

yes, we still need to be able to create bind mounts to create the environment used by the chroot

rhatdan commented 4 years ago

Thanks, I had figured that out.

rhatdan commented 4 years ago

So docker seccomp.json file blocking unshare is the issue, and should be changed, or as I reccoment use podman/CRI-O for running these containers. You can run docker with Podman /usr/share/containers/seccomp.json file.

qhaas commented 4 years ago

Could you try this with podman?

Seeing this error in podman on a ppc64le RHEL 7.6 host with a CentOS7 container.

# whoami
root
# sestatus | grep mode
Current mode:                   permissive
# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.6 (Maipo)
# arch
ppc64le
# podman --version
podman version 1.4.4
# podman run --rm -it ppc64le/centos:7
# cat /etc/redhat-release 
CentOS Linux release 7.8.2003 (AltArch)
# yum install -y buildah
...
# buildah --version
buildah version 1.11.6 (image-spec 1.0.1-dev, runtime-spec 1.0.1-dev)
# buildah from scratch
Error during unshare(CLONE_NEWUSER): Operation not permitted
ERRO error parsing PID "": strconv.Atoi: parsing "": invalid syntax 
ERRO (unable to determine exit status)
# buildah --isolation=chroot from scratch
Error during unshare(CLONE_NEWUSER): Operation not permitted
ERRO error parsing PID "": strconv.Atoi: parsing "": invalid syntax 
ERRO (unable to determine exit status)

If one starts podman with superpowers, one gets a different error:

# podman run --cap-add ALL --privileged --rm -it ppc64le/centos:7
...
# buildah from scratch  
ERRO 'overlay' is not supported over overlayfs    
'overlay' is not supported over overlayfs: backing file system is unsupported for this graph driver
# buildah --isolation=chroot from scratch
ERRO 'overlay' is not supported over overlayfs    
'overlay' is not supported over overlayfs: backing file system is unsupported for this graph driver

rhatdan commented 4 years ago

If you are in a container, then you should use buildah from --isolation=chroot, no reason to use container technology within a container.

We do a lot of configuration to make buildah run within a locked down container.

https://github.com/containers/buildah/blob/master/contrib/buildahimage/stable/Dockerfile

GJaminon commented 3 years ago

no reason to use container technology within a container.

Sorry but when building a image from inside a Jenkins container agent it is useful. Since dockerd is deprecated in Kubernetes we need an alternative; Is it possible with Buildah or do we need to find something else ?

rhatdan commented 3 years ago

The comment should have been more specific. Basically locking down a process within a container with additional duplicative lock down is not worth it. So if I have dropped caps, and running with SELinux lock down and seccomp rules locked down, then don't attempt to do them again. If the container engines attempt to, they will be blocked, because of the existing container lockdown and your container engine will fail.

It is possible to run buildah and podman within a container. The issue is how much security you lock said container down with.

Running docker within a container has the same issues. It requires a --priivleged container or a container with a leaked docker.socket from the host into the container, which is arguably less secure then just running --privileged.

GJaminon commented 3 years ago

The goal is not to have a docker socket available in a Container but to build a container image inside a CI agent running in K8S

rhatdan commented 3 years ago

Sure, but in order to run most containers, you need more then one UID within the container, and a lot of times the process needs some linux capabilities. Podman requires these. (as well as Docker).

pkit commented 1 year ago

If you are in a container, then you should use buildah from --isolation=chroot, no reason to use container technology within a container.

eh? any time users want to manipulate oci/docker image they will use "container technology within a container" as there is no other way to do so

ifaizan commented 1 year ago

The goal is not to have a docker socket available in a Container but to build a container image inside a CI agent running in K8S

@GJaminon we can run two containers inside a pod (one docker server using dind image and other one is docker client that uses tcp socket from docker server to build containers). In this way, we won't need to mount a docker socket of our host (k8s node) which is no longer available after k8s v1.18 and at the same time, build images inside a containerized jenkins build agent.

rhatdan commented 1 year ago

That would require a privileged pod though.

awildturtok commented 1 year ago

Still encountering this issue on quay.io/containers/buildah:v1.28 doing

buildah build --isolation=chroot ${CI_PROJECT_DIR}/Dockerfile

The container is run inside a Gitlab CI Pipeline

nikolaseu commented 1 year ago

Still encountering this issue on quay.io/containers/buildah:v1.28 doing
buildah build --isolation=chroot ${CI_PROJECT_DIR}/Dockerfile
The container is run inside a Gitlab CI Pipeline

Same for me

giuseppe commented 1 year ago

I think the default seccomp profile blocks unshare. You need to use a different seccomp profile

rhatdan commented 1 year ago

Dockers/containerd blocks unshare and mount. Podman, Buildah, CRI-O do not.

giuseppe commented 1 year ago

CRI-O by default blocks unshare as well. There is need to change the seccomp profile with CRI-O too

rhatdan commented 1 year ago

Ok CRI-O Should be using the same seccomp.json file as podman and buildah. rpm -qf /usr/share/containers/seccomp.json containers-common-1-89.fc38.noarch

@mrunalp @haircommander @saschagrunert WDYT?

giuseppe commented 1 year ago

That was disabled AFAIK because user namespaces open up a lot of new features that can be abused. Many security issues in the kernel in the last years were caused by user namespaces and Docker/containerd were not affected while CRI-O was. Personally I think it makes sense for CRI-O to be more locked up than Podman and allow more kernel features only when strictly necessary

haircommander commented 1 year ago

Ok CRI-O Should be using the same seccomp.json file as podman and buildah.

we actually typically embed the seccomp profile by default inside of the binary, but we also do manually remove unshare from it: https://github.com/cri-o/cri-o/blob/main/internal/config/seccomp/seccomp.go#L45 and this was done for the reasons @giuseppe mentions

rhatdan commented 1 year ago

I can hear Eric B, screaming from the hinderlands. How would a user add back unshare to his own seccomp.go file?

haircommander commented 1 year ago

they can either specify a separate profile inside of a pod spec (or unconfined if they feel so bold) or they can point cri-o to a profile on the node (like the one you attached above)

pkit commented 1 year ago

Many security issues in the kernel in the last years were caused by user namespaces and Docker/containerd were not affected while CRI-O was.

But docker (or any other purely container tech) is inherently insecure anyway. What's the point?

giuseppe commented 1 year ago

Many security issues in the kernel in the last years were caused by user namespaces and Docker/containerd were not affected while CRI-O was.

But docker (or any other purely container tech) is inherently insecure anyway. What's the point?

what do you mean with that? The point of seccomp for containers is to try to make them safer, as much as possible with the right trade-off between security and what programs would break. If you need a custom profile you can provide that.

User namespaces open up a wider kernel attack surface since more kernel features can be used (e.g. mount APIs). So to play safe it is better to disable it by default, at least on a cluster, and allow it only when it is necessary and in a controlled way.

IMO this should not be changed for CRI-O and unshare should be left disabled by default.

pkit commented 1 year ago

what do you mean with that?

I mean that docker is insecure. Either you need to fully embrace seccomp (i.e. total lockdown and a user space kernel, see gvisor) Or fully embrace a real VM (see firecracker) All the other half-solutions only create a false sense that something is secure.

giuseppe commented 1 year ago

Well there are compromises. Allowing unshare would give more possibilities to the malicious agent, e.g. https://unit42.paloaltonetworks.com/cve-2022-0492-cgroups/ could be avoided with unshare blocked.

I am closing the issue since I don't think we should change the default we currently have in CRI-O

abitrolly commented 1 year ago

@giuseppe am I right that people need to "unblock unshare" anyway to build containers in containers with buildah? In that so case the decision just makes a false claim of security. Like buildah is placing blame on users/developers without providing any secure alternative.

I also came here from GitLab, because I saw buildah as alternative to Docker in Docker. I thought it is just a a simple user space that takes files and packs them. It is very frustrating to spend time in yet another layer of problems with no result.

awildturtok commented 1 year ago

I also came here from GitLab, because I saw buildah as alternative to Docker in Docker. I thought it is just a a simple user space that takes files and packs them. It is very frustrating to spend time in yet another layer of problems with no result.

@abitrolly I had the same aspirations to use buildah - since we're 100% on podman anyway. I've since switched to kaniko which was a breeze to get going

abitrolly commented 1 year ago

@awildturtok thanks for the pointer. Going to try kaniko. I understand that Linux container security is hard, but I would rather see big companies spending time on making Kurzgesagt style videos so that more people could understand how to improve them. With SELinux and podman/buildah I admit most of the time when dealing with their errors I don't know what I am doing, and this is what frustrates me most. High respect to people who understands all that stuff. I am just not one of you.

EDIT: https://gitlab.com/abitrolly/gitlab-elasticsearch-indexer/-/jobs/4250152765#L22 kaniko rocks. )

terinjokes commented 10 months ago

It seems a bit weird to need unshared to build a multiarch manifest from already built images. AFAIK, there are no users or privileged operations happening.

containers / buildah

Error during unshare(CLONE_NEWUSER): Operation not permitted #1901