containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.63k stars 2.41k forks source link

Filesystem isolation layer #4794

Closed abitrolly closed 4 years ago

abitrolly commented 4 years ago

/kind feature

Description

podman + SELinux = Leaky Abstraction. In my story containers provide isolated environment to help users concentrate on getting their app logic right and not think about low level details and permissions required to keep their systems stable and secure.

That worked good with Docker + Ubuntu, except for the part that docker process is itself is run as root, and that made users like me feel uneasy. podman appearance solved exactly this problem for me.

What makes my think that podman on Fedora is a leaky abstraction is that after one year of trying to adopt podman into my workflow I know a lot of information that I don't need to know, and yet I am still not there.

Yesterday I was able to solve the problem I started with a year ago thanks to the knowledge that I acquired about SELinux labelling, :z and :Z prefixes, USER and uid/gid mappings (that are not directly related to this issue, but learning them was necessary to remove unfitting pieces of different puzzle from my head). The problem is that python function shutil.copystat copies extended attributes and because container uses the same filesystem as host, SELinux denies this operation when you, for example, copystat files from /bin to keep them executable or read-only as before. There was a mistake in my volume mount command, which resulted in copying files from /bin instead of from mounted /src/bin. But I could figure it out only a year later when I did the same mistake.

I thought that the problem is solved. I copy files for my build inside container only from /src/bin. But today I realized that problem is not solved, because the build system copies system installed libs to build subtree, and SELinux vs copystat problem popped up again. I don't see who can I fix this, and I don't think that going down this rabbit hole is right way anyway.

What I really want from podman is Filesystem isolation level where the filesystem in container is completely isolated from the host, and volumes are no different from any other isolated container dir. No filesystem operation from inside container should trigger SELinux or other additional filesystem or kernel drivers on the host, and hence no SELinux properties (or any other kernel drivers) should be visible in container. If volume on the host contains those labels, the modifications to these files and labels should be done as if those files are created and modified by any user level program, such as vim.

Steps to reproduce the issue:

  1. Build any snap with [stage-packages]
    podman run --rm -it -v /home/user/linux:/src:Z -w /src/snapcrafting/amend yakshaveinc/snapcraft:core18 snapcraft

Describe the results you received:

...
  File "/snap/snapcraft/current/usr/lib/python3.5/shutil.py", line 252, in copy2
    copystat(src, dst, follow_symlinks=follow_symlinks)
  File "/snap/snapcraft/current/usr/lib/python3.5/shutil.py", line 219, in copystat
    _copyxattr(src, dst, follow_symlinks=follow)
  File "/snap/snapcraft/current/usr/lib/python3.5/shutil.py", line 159, in _copyxattr
    os.setxattr(dst, name, value, follow_symlinks=follow_symlinks)
PermissionError: [Errno 13] Permission denied: '/src/snapcrafting/amend/parts/amend/ubuntu/download/libgssapi3-heimdal_7.5.0+dfsg-1_amd64.deb'
...

Describe the results you expected:

...
Cleaning later steps and re-staging amend ('stage' property changed)
Priming amend 
'confinement' property not specified: defaulting to 'strict'
'grade' property not specified: defaulting to 'stable'
Snapping 'amend' |                                                                                                                                         
Snapped amend_0.1.0_amd64.snap

Additional information you deem important (e.g. issue happens only occasionally):

To keep the user story short - as a user of container I don't want to think about SELinux attributes on my host if my unprivileged container with a volume deep inside my home tries to play with some files.

The solution I tested for LXD on Ubuntu is to mount filesystem as 9p over the network through FUSE https://github.com/yakshaveinc/linux/issues/32 I don't have money to keep focus and make a proper solution out of it (add encryption and integrate with LXD), but as proof of concept it works.

Output of podman version:

Version:            1.6.2
RemoteAPI Version:  1
Go Version:         go1.13.1
OS/Arch:            linux/amd64
rhatdan commented 4 years ago

Not really sure what is going on. I attempted to run your container on Fedora 31, but I am not sure what goes into /src/snapcrafting/amend

In side of the container SELinux should be seen as disabled so that the tools running in the container should not be trying to do SELinux operations. I think the shutil.py package is grabbing all XAttrs that is sees and attempting to apply them. And the SELinux label of the container is not allowed.

shutil.py should become more aware of what it is doing or you need to disable SELinux separation.

podman run -ti --security-opt label=disable

rhatdan commented 4 years ago

It looks like shuti.copyxattr now has support for ignoring these errors as least on Fedora 31?

/usr/lib64/python3.7/shutil.py

        """Copy extended filesystem attributes from `src` to `dst`.

        Overwrite existing attributes.

        If `follow_symlinks` is false, symlinks won't be followed.

        """

        try:
            names = os.listxattr(src, follow_symlinks=follow_symlinks)
        except OSError as e:
            if e.errno not in (errno.ENOTSUP, errno.ENODATA, errno.EINVAL):
                raise
            return
        for name in names:
            try:
                value = os.getxattr(src, name, follow_symlinks=follow_symlinks)
                os.setxattr(dst, name, value, follow_symlinks=follow_symlinks)
            except OSError as e:
                if e.errno not in (errno.EPERM, errno.ENOTSUP, errno.ENODATA,
                                   errno.EINVAL):
                    raise
abitrolly commented 4 years ago

I think the shutil.py package is grabbing all XAttrs that is sees and attempting to apply them. And the SELinux label of the container is not allowed.

I also think so. Why shutil.py is able to read those host level XAttrs in the first place? Placing requirements on what software should and should not call is what I call leaky abstraction of containerization for users.

Inside container the Python used is /snap/snapcraft/current/usr/lib/python3.5 and that's not going to change soon, because it is Python of LTS release.

--security-opt label=disable

Help says Turn off label separation for the container, but what does it exactly do?

rhatdan commented 4 years ago

It basically turns off SELinux labels, and runs container with an unconfined label. Does the shutil.py inside of the container have the same code that I attached?

abitrolly commented 4 years ago

No. It was modified 7 months ago and it is not the same in Python 3.5 branch.

https://github.com/python/cpython/blame/master/Lib/shutil.py

https://github.com/python/cpython/blob/3.5/Lib/shutil.py

rhatdan commented 4 years ago

Well there is not much we can do other then have you disable SELinux separation until this is fixed. If this even works with SELinux separation turned off.

abitrolly commented 4 years ago

This is the container https://hub.docker.com/layers/yakshaveinc/snapcraft/core18/images/sha256-a7f7a8395ff9e060fe073652c8f3663c03229c6a0b8c01f89282081ab83fc2e1

abitrolly commented 4 years ago

@rhatdan does --security-opt label=disable mean that container will be able to read my SSH keys if I by mistake give it my home?

rhatdan commented 4 years ago

Yes. If it broke out to the file system, there would be nothing preventing the reading of these files if the process was running as your UID or as root.

baude commented 4 years ago

@rhatdan what gives on this one?

rhatdan commented 4 years ago

This needs to be fixed in the python3 package, I don't see anything for us to do here.

abitrolly commented 4 years ago

podman still needs filesystem isolation layer to be adopted by general public. I wouldn't bother writing this post long if it was no so critical. An ability to easily share work directory without reading a ton of info about UID/GID/SELinux/:x and :X prefixes/ACL like I did is critical for mass adoption. Otherwise from a public project it becomes a corporate policy.

rhatdan commented 4 years ago

@abitrolly Not sure what you mean but this, The same issue would happen in Docker with SELinux turned on. And it and Podman have been adopted by the "generate public".

abitrolly commented 4 years ago

Partially the reason I started to use podman is to avoid problems with docker on SELinux enabled systems. The former can run unpriviliged containers and is engineered from scratch to be awesome, but it doesn't fix the problem with sharing volumes r/w. Maybe that's not the use case for containers at all, and I better find resources to complete my plan9 to be independent of problems with filesystems and SELinux.

On Wed, Feb 19, 2020, 12:21 AM Daniel J Walsh notifications@github.com wrote:

@abitrolly https://github.com/abitrolly Not sure what you mean but this, The same issue would happen in Docker with SELinux turned on. And it and Podman have been adopted by the "generate public".

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/containers/libpod/issues/4794?email_source=notifications&email_token=ACC72MZTPIT3WEFU7HEO3EDRDRGL7A5CNFSM4KDDAPMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMFC2FI#issuecomment-587869461, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACC72M2FGZE77EA44FDGL73RDRGL7ANCNFSM4KDDAPMA .

rhatdan commented 4 years ago

Well we can always drive python3 to fix the issue with shutils. Which they have a partial fix for. The bottom line is, the kernel prevents setting SELinux XAttrs on file systems that were mounted with the context mount, or on file systems that do not support SELinux labels. Since we are using fuse-overlay for rootless containers, we end up with this restriction. You could remove fuse-overlay and it would be allowed on a "vfs" driver, but that has a lot of headaches. At some point in the future, overlayfs might be allowed to be used by rootless user, which could also alleviate this problem. Until then Podman has to live with the limitations of what Rootless accounts provides.

abitrolly commented 4 years ago

I probably miss the point. Speaking user stories, as a user I don't want to know how SELinux works with XAttrs, how vfs and fuse-overlay operate and why overlayfs is worse or better to use podman for writing the results of running a container over my project dir (which is a Git checkout).

That's why I think that it is easier to instruct container to mount volume over network, and provide the volume through 9p share (9pfs server). Then SELinux will deny 9pfs server from reading my .ssh/ files, but will not meddle with access to my project files.

On Wed, 19 Feb 2020 at 17:34, Daniel J Walsh notifications@github.com wrote:

Well we can always drive python3 to fix the issue with shutils. Which they have a partial fix for. The bottom line is, the kernel prevents setting SELinux XAttrs on file systems that were mounted with the context mount, or on file systems that do not support SELinux labels. Since we are using fuse-overlay for rootless containers, we end up with this restriction. You could remove fuse-overlay and it would be allowed on a "vfs" driver, but that has a lot of headaches. At some point in the future, overlayfs might be allowed to be used by rootless user, which could also alleviate this problem. Until then Podman has to live with the limitations of what Rootless accounts provides.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/containers/libpod/issues/4794?email_source=notifications&email_token=ACC72M47JBNVIDWECI7TLOLRDU7OFA5CNFSM4KDDAPMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMIDC3Y#issuecomment-588263791, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACC72M32MVYITR27J6JXAWDRDU7OFANCNFSM4KDDAPMA .

-- Anatoli Babenia

+1 (650) 605-3365 +375 (29) 320-4241

abitrolly commented 4 years ago

The filesystem isolation layer with 9pfs could also fix this top voted Docker problem https://github.com/moby/moby/issues/2259

abitrolly commented 2 years ago

Looks like I managed to screw up my system with podman :z flag - https://github.com/teemtee/tmt/issues/1179 I am certain that it is because of podman, because it is the only thing I use that interacts with SELinux labels. Any hints how to fix that without disabling SELInux completely?

rhatdan commented 2 years ago

restorecon -v -R -F PATHTOBADDIR.

abitrolly commented 2 years ago

Why it doesn't work without -F?

➜  ~ restorecon -v ~
/home/anatoli not reset as customized by admin to system_u:object_r:container_file_t:s0:c385,c528

How podman manages to customize labels as admin if it is running rootless? Why it doesn't pay attention that mounted dirs are already have some labels that are not supposed to be overwritten?

rhatdan commented 2 years ago

Shoot, I thought that overroad customizable types.

You can hack this by doing two commands.

chcon -t user_home_t -R PATHTOBADDIR restorecon -v -R -F PATHTOBADDIR