libpod: do not chmod bind mounts

giuseppe commented 3 months ago

with the new mount API is available, the OCI runtime doesn't require that each parent directory for a bind mount must be accessible. Instead it is opened in the initial user namespace and passed down to the container init process.

This requires that the kernel supports the new mount API and that the OCI runtime uses it.

Closes: https://github.com/containers/podman/issues/23028

Does this PR introduce a user-facing change?

Now Podman requires the new kernel mount API (available since Linux 5.2) to configure containers in a new user namespace where the current user is not part of the mapping

openshift-ci[bot] commented 3 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/containers/podman/blob/main/OWNERS)~~ [giuseppe] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

Luap99 commented 3 months ago

Looks like it might be needed after all.

giuseppe commented 3 months ago

I was over optimistic :)

I've pushed a new version that really deals only with the bind mounts, and this works well with runc too.

I've marked the PR as a Draft for now, I want to check better what we can do with the remaining occurrences

Luap99 commented 3 months ago

So thinking about this more I assume the oci runtimes try to use the new mount API by default but then silently fall back to the old one if the new one doesn't work? Doesn't this mean that we break the userns containers on old distros or even just limited envs (i.e. are there any seccomp profiles blocking the mount API but not the old mount syscall)? As such would it make sense to maybe check in podman if the syscall exists/works first only only do the chmod if it does not? The alternative would be we have to document that we only work on kernels with the new mount API (it has been around for a while so maybe this is not a problem in practise).

giuseppe commented 3 months ago

So thinking about this more I assume the oci runtimes try to use the new mount API by default but then silently fall back to the old one if the new one doesn't work? Doesn't this mean that we break the userns containers on old distros or even just limited envs (i.e. are there any seccomp profiles blocking the mount API but not the old mount syscall)? As such would it make sense to maybe check in podman if the syscall exists/works first only only do the chmod if it does not? The alternative would be we have to document that we only work on kernels with the new mount API (it has been around for a while so maybe this is not a problem in practise).

I've originally added the makeAccessible logic exactly for this reason, since the mount API was not available everywhere. fsopen(2) was added to Linux 5.2, that was released on 7 July 2019. IMO, at this point we can safely assume the new API is available everywhere we care. I'd not mind keeping the fallback, since we already have the code, but it is quite invasive as we chown different directories.

Still marked as a Draft as I need to polish the commits, pushed early only to trigger the CI

Luap99 commented 3 months ago

I am fine with not having the fallback if it is out for that long as long as we document this in the release note that we require the new mount API (kernel 5.2).

mheon commented 3 months ago

LGTM

rhatdan commented 3 months ago

/lgtm

containers / podman

libpod: do not chmod bind mounts #23032

Does this PR introduce a user-facing change?