containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.75k stars 2.42k forks source link

Random relabelling failures #1739

Closed EmilienM closed 5 years ago

EmilienM commented 6 years ago

Is this a BUG REPORT or FEATURE REQUEST?:

kind bug

Description Since we have podman-0.10.1.3-2.git6e1aeb0.el7.x86_64 in OpenStack CI, we have more than 65% of failure in our jobs and they fail for the same reason: relabelling a one directory (always the same).

Steps to reproduce the issue:

  1. Running podman command:

    /usr/bin/podman run --user root --name docker-puppet-ironic_inspector --env PUPPET_TAGS=file,file_line,concat,augeas,cron,ironic_inspector_config --env NAME=ironic_inspector --env HOSTNAME=undercloud --env NO_ARCHIVE= --env STEP=6 --env NET_HOST=true --volume /etc/localtime:/etc/localtime:ro --volume /tmp/tmpXkPhuw:/etc/config.pp:ro,z --volume /etc/puppet/:/tmp/puppet-etc/:ro,z --volume /etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume /etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume /etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume /etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume /var/lib/config-data:/var/lib/config-data/:rw,z --volume /dev/log:/dev/log:rw --volume /var/lib/docker-puppet/docker-puppet.sh:/var/lib/docker-puppet/docker-puppet.sh:rw,z --security-opt label=disable --volume /usr/share/openstack-puppet/modules/:/usr/share/openstack-puppet/modules/:ro --volume /var/lib/ironic:/var/lib/ironic:z --volume /var/lib/ironic-inspector/dhcp-hostsdir:/var/lib/ironic-inspector/dhcp-hostsdir:z --entrypoint /var/lib/docker-puppet/docker-puppet.sh --net host --volume /etc/hosts:/etc/hosts:ro 192.168.24.1:8787/tripleomaster/centos-binary-ironic-inspector:current-tripleo-updated-20181026224149
  2. It fails randomly, and never on the same container, but always on /var/lib/config-data relabelling.

Describe the results you received:

relabel failed \"/var/lib/config-data\": no such file or directory"

Describe the results you expected: Container should be run and relabelling should work.

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

podman-0.10.1.3-2.git6e1aeb0.el7.x86_64

Output of podman info:

host:
  BuildahVersion: 1.5-dev
  Conmon:
    package: podman-0.10.1.3-2.git6e1aeb0.el7.x86_64
    path: /usr/libexec/podman/conmon
    version: 'conmon version 1.12.0, commit: 097cf71375bd18454d1c48b9fdf8ccff2ed995f8-dirty'
  Distribution:
    distribution: '"centos"'
    version: "7"
  MemFree: 401932288
  MemTotal: 8365150208
  OCIRuntime:
    package: runc-1.0.0-54.dev.git2abd837.el7.x86_64
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.0'
  SwapFree: 8556113920
  SwapTotal: 8588881920
  arch: amd64
  cpus: 8
  hostname: undercloud.localdomain
  kernel: 3.10.0-862.14.4.el7.x86_64
  os: linux
  uptime: 1h 3m 44.05s (Approximately 0.04 days)
insecure registries:
  registries:
  - 192.168.24.1:8787
  - 192.168.24.3:8787
registries:
  registries:
  - docker.io
  - registry.fedoraproject.org
  - quay.io
  - registry.access.redhat.com
  - registry.centos.org
store:
  ContainerStore:
    number: 1
  GraphDriverName: overlay
  GraphOptions:
  - overlay.override_kernel_check=true
  GraphRoot: /var/lib/containers/storage
  GraphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
  ImageStore:
    number: 22
  RunRoot: /var/run/containers/storage

Additional environment details (AWS, VirtualBox, physical, etc.): The CI jobs run in VMs, with 8vcpu, 8GB of RAM and 8GB of swap.

mheon commented 6 years ago

@rhatdan PTAL - Seems to be an SELinux issue. The code in Podman seems fine (we relabel the path in question with the mount label, nothing special) - could be the go-selinux Relabel code?

mheon commented 6 years ago

For reference:

Our call out to go-selinux to relabel volume mounts: https://github.com/containers/libpod/blob/master/libpod/container_internal_linux.go#L159-L161

EmilienM commented 6 years ago

FTR we also track it in OpenStack: https://bugs.launchpad.net/tripleo/+bug/1800737/

rhatdan commented 6 years ago

This seems pretty clear that the /var/lib/config-data directory does not exists. Podman is different then docker in that it does NOT create SRC volumes when they don't exists. If you were relying on this BUG in docker for podman, you will need to do a mkdir -p /var/lib/config-data; podman ..., to make sure the directory exists before the relabel is attempted.

I believe this is a BUG in Docker, because it can lead to user creating content with typos in their commands. For example, imaging I typo'd the above command -v /var/lib/configdata:/var/lib/config-data,rw,Z, In Docker it would create the typo'd directory and you could end up with unexpected errors, when other tools looked for /var/lib/config-data on the host.

EmilienM commented 6 years ago

@rhatdan I'm 99% sure that the directory does exist, it's actually manage by Ansible and you can see its creation here:

2018-10-30 21:08:15.606 17849 WARNING tripleoclient.v1.tripleo_deploy.Deploy [  ] TASK [Create /var/lib/config-data directory] ***********************************
2018-10-30 21:08:15.803 17849 WARNING tripleoclient.v1.tripleo_deploy.Deploy [  ] changed: [undercloud]
2018-10-30 21:08:15.842 17849 WARNING tripleoclient.v1.tripleo_deploy.Deploy [  ]

Source: http://logs.openstack.org/40/613640/1/gate/tripleo-ci-centos-7-containers-multinode/e212bb8/logs/undercloud/home/zuul/install-undercloud.log.txt.gz#_2018-10-30_21_08_15_606

Or here:

Invoked with directory_mode=None force=False remote_src=None path=/var/lib/config-data owner=None follow=True group=None unsafe_writes=None state=directory content=NOT_LOGGING_PARAMETER serole=None diff_peek=None setype=svirt_sandbox_file_t selevel=s0 original_basename=None regexp=None validate=None src=None seuser=None recurse=False delimiter=None mode=None attributes=None backup=None

Source: http://logs.openstack.org/40/613640/1/gate/tripleo-ci-centos-7-containers-multinode/e212bb8/logs/undercloud/var/log/journal.txt.gz#_Oct_30_21_08_15

And here in the code: https://github.com/openstack/tripleo-heat-templates/blob/3b68405f5a94f18df989522526150bf0f53809e2/common/deploy-steps-tasks.yaml#L246-L251

I'm going to verify this 1% of incertitude today but please note that it usually fails on one container while other get deployed, with the same bind mounts (including /var/lib/config-data). Also Please note that it worked fine with podman 0.9 and seems broken for us in 0.10.

Thanks

EmilienM commented 6 years ago

@rhatdan also, to demonstrate that the error message isn't the same when the directory doesn't exist:

[root@undercloud ~]# podman run --rm -ti -v /foo:/bar busybox bash
Trying to pull docker.io/busybox:latest...Getting image source signatures
Copying blob sha256:90e01955edcd85dac7985b72a8374545eac617ccdddcc992b732e43cd42534af
 710.92 KB / 710.92 KB [====================================================] 0s
Copying config sha256:59788edf1f3e78cd0ebe6ce1446e9d10788225db3dedcfd1a59f764bad2b2690
 1.46 KB / 1.46 KB [========================================================] 0s
Writing manifest to image destination
Storing signatures
error checking path "/foo": stat /foo: no such file or directory
[root@undercloud ~]# podman run --rm -ti -v /foo:/bar:z,rw busybox bash
error checking path "/foo": stat /foo: no such file or directory

See error checking path "/foo": stat /foo: no such file or directory versus relabel failed \"/var/lib/config-data\": no such file or directory". Again, let me confirm all of that today but if you have any clue, let us know.

EmilienM commented 6 years ago

and in addition, I just found out that we make sure /var/lib/config-data really exists: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/tree/docker/docker-puppet.py#n63

So with that, it's pretty clear that the directory is here and the error message is probably wrong during the relabelling issue.

EmilienM commented 6 years ago

Also why is podman trying to relabel the directory while we run the container with --security-opt label=disable ?

mheon commented 6 years ago

That's a good question - @rhatdan Should we skip the -z/-Z relabel if we are running with SELinux disabled?

mheon commented 6 years ago

Meanwhile, been tracing down what's going on here. Fairly certain that the no such file or directory is an ENOENT coming out of an lsetxattr() to set the SELinux label, which seems to indicate this is coming out of the kernel?

mheon commented 6 years ago

Per the official docks on setxattr() and related calls, ENOENT is the standard "a component of the path does not exist", with no twists - so, per the kernel, the file in question does not exist, when it pretty clearly does?

EmilienM commented 6 years ago

To prove that /var/lib/config-data REALLY exists:

1) In our CI scripts, we collect logs and we copy /var/lib/config-data/puppet-generated into /var/log/config-data. The code is here: https://git.openstack.org/cgit/openstack-infra/tripleo-ci/tree/scripts/get_docker_logs.sh#n47

2) In a failing job, you can see that the directory was successfully collected: http://logs.openstack.org/40/613640/1/gate/tripleo-ci-centos-7-containers-multinode/e212bb8/logs/undercloud/var/log/config-data/

Which means /var/lib/config-data/puppet-generated does exist, therefore /var/lib/config-data is here.

rhatdan commented 6 years ago

We did make a change to always have a mount label even when SELinux Labeling is disabled, which is probably what the difference is here.

What kind of file system is mounted on /var/log? Just a normal ext4 or xfs?

EmilienM commented 6 years ago

partitions => {"vda1"=>{"uuid"=>"d56e4695-de15-46eb-8259-25a16ed8f6ce", "size"=>"335542239", "mount"=>"/", "label"=>"cloudimg-rootfs", "filesystem"=>"ext4"}}

so / is ext4

EmilienM commented 6 years ago

I proposed this workaround for now: https://review.openstack.org/614825 (remove -z from /var/lib/config-data mount)

rhatdan commented 6 years ago

If you are running a confined domain this will not work unless you pre label the content as container_file_t:s0.

giuseppe commented 5 years ago

this is a TOCTTOU issue. I think this depends on the golang binding forChcon not being atomic, the /var/lib/config-data is probably read/written by other processes and by the time we are walking the directory some files are deleted/moved so that the lsetxattr fails.

I think the solution is to modify Chcon to not give up on an ENOENT (or avoiding relabelling /var/lib/config-data).

giuseppe commented 5 years ago

PR here: https://github.com/opencontainers/selinux/pull/37