coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
146 stars 30 forks source link

CoreOS 1122.2.0 stable: SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) #1580

Closed alogoc closed 6 years ago

alogoc commented 8 years ago

Issue Report

After upgrading the cluster to version 1122.2.0 stable I started seeing this error on the logs

[181124.963513] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
dmesg | grep -i "SELinux: mount invalid" | wc -l
574

Bug

Unless forcefully rebooted, it is causing reboot to hang while trying to umount NFS kubernetes persistent volumes. This happens on every reboot ever since upgraded to version 1122.2.0 stable.

CoreOS Version

$ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1122.2.0
VERSION_ID=1122.2.0
BUILD_ID=2016-09-06-1449
PRETTY_NAME="CoreOS 1122.2.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Selinux status

SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             mcs
Current mode:                   permissive
Mode from config file:          permissive
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Max kernel policy version:      30

Environment

What hardware/cloud provider/hypervisor is being used to run CoreOS?

VMware

gianrubio commented 7 years ago

Issue Report

Same issue here using aws ebs volumes

Bug

Some volumes are not mounted

[  793.976852] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  855.417000] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  864.572472] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  923.424917] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  933.778011] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  994.015922] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1003.962429] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)

CoreOS Version

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1235.5.0
VERSION_ID=1235.5.0
BUILD_ID=2017-01-08-0037
PRETTY_NAME="Container Linux by CoreOS 1235.5.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Environment

What hardware/cloud provider/hypervisor is being used to run CoreOS?

Aws

jonaz commented 7 years ago

Same issue after automatic upgrade to:

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1235.5.0
VERSION_ID=1235.5.0
BUILD_ID=2017-01-08-0037
PRETTY_NAME="Container Linux by CoreOS 1235.5.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
dangoncalves commented 7 years ago

I confirm this bug too on kvm/libvirt

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1235.5.0
VERSION_ID=1235.5.0
BUILD_ID=2017-01-08-0037
PRETTY_NAME="Container Linux by CoreOS 1235.5.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
GJKrupa commented 7 years ago

Seeing the same under ESXi running a clean install of version 1284.2.0. The log line shows up every time the master tries to deploy kubernetes-dashboard on one of the minions.

roffe commented 7 years ago

I am observing the same errors on CoreOS stable (1235.6.0)

weikinhuang commented 7 years ago

Same issue here as well:

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1298.3.0
VERSION_ID=1298.3.0
BUILD_ID=2017-02-02-0148
PRETTY_NAME="Container Linux by CoreOS 1298.3.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Running ESXi VM, with nfs4 mounts, Kubernetes cluster. Happens as soon as an auto update occurs. Initial installation is fine.

derriana17 commented 7 years ago

Hi, It seems like a few of us have this issue, but I don't see any possible solutions yet?

eskaaren commented 7 years ago

Same here CoreOS stable 1235.9.0 on vSphere 6.

derriana17 commented 7 years ago

Has anybody got any fixes to this?

Sent from my Phone

DA.

On 11 Feb 2017, at 12:48, Eivin Giske Skaaren notifications@github.com<mailto:notifications@github.com> wrote:

Same here CoreOS stable 1235.9.0 on vSphere 6.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/coreos/bugs/issues/1580#issuecomment-279163070, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYaOX4lK3O3kvTlqoBGNDRPKTcZUTatVks5rbfR6gaJpZM4KDy9y.

[jones knowles ritchie]http://www.jkrglobal.com 85 Spring Street 3rd Floor New York, NY 10012 +1 (347) 205 8200

Follow us: [fb] https://www.facebook.com/jkrGlobal [in] https://www.linkedin.com/company/jkr [ig] http://instagram.com/jkrglobal [tw] https://twitter.com/jkrGlobal [cp] http://creativepool.com/jkr


This email and any attachments are confidential and may also be privileged. If you are not the addressee, do not disclose, copy, circulate or in any other way use or rely on the information contained in this email or any attachments. If received in error, notify the sender immediately and delete this email and any attachments from your system. Emails cannot be guaranteed to be secure or error free as the message and any attachments could be intercepted, corrupted, lost, delayed, incomplete or amended. Jones Knowles Ritchie does not accept liability for damage caused by this email or any attachments and may monitor email traffic. ..p..

pgburt commented 7 years ago

Do we have a recommendation for what folks who are experiencing this should do?

Happy to see this is flagged as a p0. Until a patch is out, what's the best course of action to take in the interim?

qrpike commented 7 years ago

Still having this issue using latest stable version of CoreOS

styxlab commented 7 years ago

Experience the same, but only after reboot and not on a freshly provisioned machine:

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1235.12.0
VERSION_ID=1235.12.0
BUILD_ID=2017-02-23-0222
PRETTY_NAME="Container Linux by CoreOS 1235.12.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Workaround: Restart the nfs-sever. In my case systemctl restart nfsd.service.

weikinhuang commented 7 years ago

I just noticed that it was on every/any reboot, and not just when it auto updates.

savar commented 7 years ago

we have this issue as well

TimJones commented 7 years ago

Still seeing this error on baremetal too (matchbox 0.5.0 w/ bootkube 0.3.9)

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1298.5.0
VERSION_ID=1298.5.0
BUILD_ID=2017-02-28-0013
PRETTY_NAME="Container Linux by CoreOS 1298.5.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             mcs
Current mode:                   permissive
Mode from config file:          permissive
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Max kernel policy version:      30
remoe commented 7 years ago

I've see this on CoreOS Stable 1298.6.0 on VSphere 6.0.

reignblack commented 7 years ago

I'm also experiencing the same error when i reboot, after reboot it spams these errors for about a minute and then stops. I am running stable 1298.6.0 and similar to TimJones, I am deploying to baremetal via matchbox/tectonic

deitch commented 7 years ago

Same here, running Stable 1298.7.0 on AWS.

Took me forever to figure out what the source of it was. Came from kubernetes trying to run weave. Long trail of errors to get here.

euank commented 7 years ago

This error shows up in dmesg on my machines (both Container Linux and Fedora) each time I run a docker container.

However, I have yet to see any adverse effect from it.

The original issue mentions it causing a "... reboot to hang while trying to umount NFS kubernetes persistent volumes", and there are a few other issues attributed to it here, but I worry that each of the mentioned issues is caused by something else and this is a red herring.

For my machines, ignoring this dmesg output hasn't caused any issues yet. This includes a Kubernetes cluster with some nfs mount churn, as well as machines just running a few once-off containers.

If anyone is confident that this is causing real impact, other than a dmesg logline, the impact and how the two were linked together would be helpful!

deitch commented 7 years ago

I removed --selinux-enabled for the docker engine command line, but still had many other issues. Now that they all are resolved, I will try to restore --selinux-enabled and see if there are any side effects.

nyilmaz commented 7 years ago

@euank along with dmesg log, services that are installed in our pods, are not accessible both locally and outside the pod; ie there isn't any tcp connections incoming/outgoing to/from pods although interfaces are up and dns is well-configured.

Update was 4.9.16-coreos-r1 -> 4.9.24-coreos.

Restarts happened for our other machines also however some of them recovered without downtime.

The difference is problematic machine's last dmesg message is (then it stops);

SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)

It seems it blocks other services somehow (like docker networking in this case).

euank commented 7 years ago

@nyilmaz I suspect the networking issue you're seeing is #1936. We're pushing an additional update to stable to address that issue. Sorry!

Assuming that's the issue, a workaround is posted on that thread and it's unrelated to this dmesg output. If you can double check that workaround works, or the update (1353.7.0) works once it rolls out, that would help clarify.

bchanan03 commented 7 years ago

Apr 27 12:34:34 ip-10-0-2-40.us-west-2.compute.internal dockerd[2415]: time="2017-04-27T12:34:34.657727261Z" level=error msg="Create container failed with error: invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"process_linux.go:359: container init caused \\\\\\\"rootfs_linux.go:53: mounting \\\\\\\\\\\\\\\"/data/k8s/kubelet/pods/5932bf47-2b3f-11e7-ae8d-023161a008ef/etc-hosts\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\"/var/lib/docker/overlay/85e7976aff2ede0c039d033503b6dbb72154a2110a0c5678f0e569d8fc256c29/merged\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\"/var/lib/docker/overlay/85e7976aff2ede0c039d033503b6dbb72154a2110a0c5678f0e569d8fc256c29/merged/etc/hosts\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\"not a directory\\\\\\\\\\\\\\\"\\\\\\\"\\\"\\n\""

Just moved running kubelet from hyperkube to kube_wrapper. Issue reported once again - all pods with persistent volumes fails now.

NAME="Container Linux by CoreOS" ID=coreos VERSION=1353.7.0 VERSION_ID=1353.7.0 BUILD_ID=2017-04-26-2154 PRETTY_NAME="Container Linux by CoreOS 1353.7.0 (Ladybug)" ANSI_COLOR="38;5;75" HOME_URL="https://coreos.com/" BUG_REPORT_URL="https://issues.coreos.com"

mikesplain commented 7 years ago

We're seeing this issue as well. Very similar to @nyilmaz, we are see blocking happening in docker and kubelet-wrapper. We're running the latest stable build from scratch:

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1353.7.0
VERSION_ID=1353.7.0
BUILD_ID=2017-04-26-2154
PRETTY_NAME="Container Linux by CoreOS 1353.7.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"

 # dmesg | grep -i "SELinux: mount invalid" | wc -l
11
 # uptime
 15:26:52 up 30 min,  1 user,  load average: 0.02, 0.11, 0.27

Compared to one of our older hosts that isn't having this issue:

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1298.7.0
VERSION_ID=1298.7.0
BUILD_ID=2017-03-31-0215
PRETTY_NAME="Container Linux by CoreOS 1298.7.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

 # dmesg | grep -i "SELinux: mount invalid" | wc -l
0
 # uptime
 15:28:26 up 2 days, 22:48,  1 user,  load average: 1.20, 0.59, 0.27
euank commented 7 years ago

@bchanan03 based on the directory in that error, it looks like you're using the --root-dir flag on the kubelet.

That flag isn't supported with the kubelet-wrapper and will break unless you make an additional effort to bindmount the required extra directories. That error message is basically saying that the kubelet made mounts under --root-dir inside the kubelet-wrapper chroot and that docker cannot find them because the kubelet-wrapper script didn't expose that directory.

I don't think that issue is related to this one, though if you continue to run into issues after either no longer using the --root-dir flag or after adjusting the kubelet-wrapper's mount args, please open a new issue.

mars64 commented 7 years ago

@mikesplain I just wanted to demonstrate a host we have on same old version, with the error:

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1298.7.0
VERSION_ID=1298.7.0
BUILD_ID=2017-03-31-0215
PRETTY_NAME="Container Linux by CoreOS 1298.7.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
# dmesg | grep -i "SELinux: mount invalid" | wc -l
12
# uptime
 17:40:36 up 2 days,  1:17,  1 user,  load average: 0.96, 0.67, 0.36

My guess is we're doing something incorrectly, but figured I'd post what I have. We do also see this on 1353.7.0, but there are other issues I don't yet understand which are preventing us from running the updated version.

mikesplain commented 7 years ago

@mars64 Ahh fair enough. Thanks!

gdmello commented 7 years ago

I'm seeing this issue on a later version as well on an EC2 instance (m4.xlarge) -

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1353.8.0
VERSION_ID=1353.8.0
BUILD_ID=2017-05-30-2322
PRETTY_NAME="Container Linux by CoreOS 1353.8.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"

Actual error from logs (via aws console), since instance not reachable anymore via SSH-

SSH host key: SHA256:<key>(DSA)
SSH host key: SHA256:xxxxxxxxx[   27.961692] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready<sha key> (ECDSA)
SSH host key: SHA256:<shakey> (ED25519)
SSH host key: SHA256:<sha key> (RSA)
eth0: 10.100.100.116 fe80::67:8aff:fee6:8e57

ip-10-100-100-116 login: [   31.061655] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
[   32.412445] Netfilter messages via NETLINK v0.30.
[   32.422056] ip_set: protocol 6
[   32.530060] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   36.402077] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[   36.417821] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[   37.456837] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[   37.476006] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[63950.093111] dockerd: page allocation failure: order:4, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
[63950.099049] dockerd cpuset=/ mems_allowed=0
[63950.101490] CPU: 1 PID: 22782 Comm: dockerd Not tainted 4.11.6-coreos #1
[63950.105420] Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017
[63950.109066] Call Trace:
[63950.110550]  dump_stack+0x63/0x90
[63950.112543]  warn_alloc+0x11c/0x1b0
[63950.114635]  ? __alloc_pages_direct_compact+0x55/0x110
[63950.117660]  __alloc_pages_slowpath+0xd6c/0xe50
[63950.120486]  ? wakeup_kswapd+0xdd/0x150
[63950.122787]  __alloc_pages_nodemask+0x21b/0x230
[63950.125489]  alloc_pages_current+0x8c/0x110
[63950.128020]  kmalloc_order+0x18/0x40
[63950.130188]  kmalloc_order_trace+0x24/0xa0
[63950.132602]  __kmalloc+0x1a2/0x210
[63950.134651]  ? __list_lru_init+0x35/0x210
[63950.137001]  __list_lru_init+0x1a8/0x210
[63950.139319]  sget_userns+0x22d/0x4d0
[63950.141666]  ? get_anon_bdev+0x100/0x100
[63950.144130]  sget+0x7d/0xa0
[63950.145840]  ? get_anon_bdev+0x100/0x100
[63950.148209]  ? 0xffffffffc04e9d60
[63950.150700]  mount_nodev+0x30/0xa0
[63950.152790]  0xffffffffc04e90e8
[63950.154751]  mount_fs+0x38/0x170
[63950.156721]  vfs_kern_mount+0x67/0x110
[63950.159003]  do_mount+0x1e5/0xcb0
[63950.161094]  ? _copy_from_user+0x4e/0x80
[63950.163824]  SyS_mount+0x94/0xd0
[63950.165999]  do_syscall_64+0x5a/0x160
[63950.168460]  entry_SYSCALL64_slow_path+0x25/0x25
[63950.172058] RIP: 0033:0x654f7a
[63950.174147] RSP: 002b:000000c43b14ee30 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[63950.179414] RAX: ffffffffffffffda RBX: 000000c42001ca0c RCX: 0000000000654f7a
[63950.184083] RDX: 000000c426adbdd8 RSI: 000000c4293cddc0 RDI: 000000c426adbdd0
[63950.189498] RBP: 000000c43b14eee0 R08: 000000c4281ae1a0 R09: 0000000000000000
[63950.194196] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000070
[63950.199207] R13: 0000000000001dc0 R14: 0000000000000045 R15: 0000000005555555
[63950.204506] Mem-Info:
[63950.206083] active_anon:1236082 inactive_anon:3926 isolated_anon:0
[63950.206083]  active_file:27545 inactive_file:17530 isolated_file:0
[63950.206083]  unevictable:0 dirty:535 writeback:29 unstable:0
[63950.206083]  slab_reclaimable:331728 slab_unreclaimable:1737063
[63950.206083]  mapped:27337 shmem:15893 pagetables:56081 bounce:0
[63950.206083]  free:35374 free_pcp:0 free_cma:0

seems like docker0 didn't come up. The only customization - mount an ext4fs EBS volume and configure docker to log to it.

This issue also occurred a day after the k8s nodes were provisioned.

lucab commented 7 years ago

@gdmello as stated in comments above, the "mount invalid" entry is just unrelated noise in the log. The issue you are experiencing is completely unrelated and seems to be due to the kernel not being able to allocate a pretty big range of contiguous pages. This can be due to either a kernel bug, an hypervisor bug or some abnormal memory pressure.

Please try to reproduce it on a latest stable/beta/alpha, and also check your telemetry for memory consumption profile in the period of time leading up to the issue. If still ocurring, please open a dedicated bug report with all the information.

gdmello commented 7 years ago

Thanks @lucab!

You are right - i do see this error even on a healthy kubernetes node. So it's a non-issue.

euank commented 7 years ago

The offending message isn't printed in the current alpha. It's fixed by shipping a newer version of docker (17.06.1) which doesn't have this issue. (I did verify it was the docker version change, not kernel changes, which fixes this).

We should be able to close this once we have 17.06+ on stable.

roffe commented 6 years ago

The noise is still spewing out on busy nodes, now i'm on 1520.7.0

roffe commented 6 years ago

@euank will 17.06 be supported by Kubernetes?, Correct me if i'm wrong but i've only seen talk about 1.11.2, 1.12.6, 1.13.1, and 17.03.2 being validated so far

euank commented 6 years ago

@roffe The answer to that is a little complicated. It's possible Kubernetes will move to recommending certain docker API versions, regardless of the release version (https://github.com/kubernetes/kubernetes/issues/53221). If they move to recommending it in that way, 17.06/17.09 both support the API version they use and would thus implicitly be considered valid I believe (with further validation and choice up to specific K8s distribution's discretion). I don't know any more than is in that issue; for more details or if you have other questions, you'd have to ask the Kubernetes project yourself.

As an idle anecdote, I personally run my K8s cluster against 17.09. I can't recommend that generally since my personal requirements and testing are less authoritative than the Kubernetes project, but I will point out the upstream recommendations are not generally based on known problems with newer versions, but rather lack of evidence altogether.

hookenz commented 6 years ago

Container Linux by CoreOS stable (1576.4.0) Update Strategy: No Reboots core@compute1 ~ $ docker --version Docker version 17.09.0-ce, build afdb6d4

euank commented 6 years ago

I'm closing this based on my past few comments in this issue; on recent versions of docker (which are shipped by default in all channels now) it shouldn't appear, and when it did appear I think it was typically benign.

If you still encounter this on a recent version of docker and it appears to cause real impact, please do open a new issue!