CrashLoopBackOff on the database

MaxenceAdnot commented 8 years ago

I'm trying to deploy the deis-dev chart on AWS and it seems that the database refuses to start...

CoreOS version is 976.0.0 (alpha)

docker logs 6ece7bde9d :

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

initdb: could not change permissions of directory "/var/lib/postgresql/data": Operation not permitted
fixing permissions on existing directory /var/lib/postgresql/data ...

I also noticed that line in kernel logs :

Mar 12 21:07:48 ip-10-0-0-144.eu-west-1.compute.internal dockerd[1128]: time="2016-03-12T21:07:48.916811751Z" level=warning msg="DEPRECATED: Setting host configuration options when the container starts is deprecated and will be removed in
Mar 12 21:07:48 ip-10-0-0-144.eu-west-1.compute.internal systemd[1]: Started docker container 6ece7bde9d462a3a1d66d480a9e9aa321e895f61c8dd6f7e30f0fa14a7e9fd72.
Mar 12 21:07:49 ip-10-0-0-144.eu-west-1.compute.internal kernel: SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
Mar 12 21:07:49 ip-10-0-0-144.eu-west-1.compute.internal systemd[1]: Stopped docker container 6ece7bde9d462a3a1d66d480a9e9aa321e895f61c8dd6f7e30f0fa14a7e9fd72.

I don't really like this SELinux message :/

Any ideas ?

slack commented 8 years ago

We spoke a bit on the new Deis slack! (Shameless plug, join us at https://slack.deis.io)

I was able to reproduce on a CoreOS 983.0.0 cluster out of the box. May be related to userns + selinux + Docker 1.10.2 but I haven't pinned down a specific change just yet.

I was able to successfully boot the postgres component by specifying a volume, mounted at /var/lib/postgres.

@MaxenceAdnot if you could modify the deis-database-rc.yaml with the changes in https://github.com/deis/charts/pull/160/files that would be awesome.

That can be accomplished by:

helm uninstall deis-dev <-- will remove your deis install
helm edit deis-dev <-- hand edit tpl/deis-database-rc.yaml
run helm generate deis-dev
helm install deis-dev

OR:

kubectl --namespace=deis edit rc deis-database <-- hand-edit the volume information from the PR
kubectl --namespace=deis delete deis-databse-XYZ123 < --- delete the database pod

Edit: clarity

MaxenceAdnot commented 8 years ago

Thank you for the quick fix. I will try this in a few hours and let you know.

MaxenceAdnot commented 8 years ago

Your patch is working fine ! Thank you @slack

slack commented 8 years ago

Excellent, thanks for testing! Will get this merged in the morning.

bacongobbler commented 8 years ago

The upstream issue for this is https://github.com/docker/docker/issues/7952. It seems to be Docker + btrfs with SELinux enabled causes this issue. Can you try running docker on a non-btrfs backend and see if you see the same issue?

This did not necessarily resolve the issue, just circumvent it. Volume mounts are likely non-btrfs so SELinux is playing nicely with it. This project is also not designed around being backed by a persistent disk as demontrated in the end-to-end tests at https://ci.deis.io/.

You'll likely see this issue again in another form in the future, so I don't feel this problem is resolved. I'm going to revert this change and further debug this issue.

slack commented 8 years ago

core@ip-10-0-0-175 ~ $ docker info
Containers: 43
 Running: 7
 Paused: 0
 Stopped: 36
Images: 10
Server Version: 1.10.2
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: null host bridge
Kernel Version: 4.4.4-coreos
Operating System: CoreOS 983.0.0 (Coeur Rouge)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 3.679 GiB
Name: ip-10-0-0-175.us-west-2.compute.internal
ID: OP6Q:UNHM:BTT4:FCYX:3QTL:HZUQ:ET7E:EOXC:RYJO:L5ZK:VLQO:YYCH

MaxenceAdnot commented 8 years ago

I don't think CoreOS is using btrfs at all (from my memory Docker on CoreOS is using overlay as storage driver).

bacongobbler commented 8 years ago

Some concrete evidence supporting this would be this comment: https://github.com/docker/docker/issues/7952#issuecomment-54989852

@slack I think this includes overlayfs as well, judging from that comment:

Kernel engineers are working on a fix for this and potentially Overlayfs if it gets merged into the container.

bacongobbler commented 8 years ago

Reading further down the page, the temporary fix seems to be removing --selinux-enabled from the docker daemon's option flag list. Overlay is indeed broken as well.

Source for fix: https://github.com/docker/docker/issues/7952#issuecomment-56435657

bacongobbler commented 8 years ago

@MaxenceAdnot how did you deploy this cluster on AWS? Just following the kubernetes documentation and ran KUBERNETES_PROVIDER=aws ./cluster/kube-up.sh?

MaxenceAdnot commented 8 years ago

I used the kube-aws tool provided by CoreOS to deploy Kubernetes on AWS.

bacongobbler commented 8 years ago

Also note that Docker has disabled SELinux support on v1.9.1 due to the above bug. This is what I get on Fedora 23 with SELinux enabled and with the overlay driver:

[vagrant@localhost ~]$ docker version -f '{{.Client.Version}}' 2>/dev/null
1.9.1
[vagrant@localhost ~]$ cat /etc/sysconfig/docker | grep OPTIONS=
OPTIONS='--selinux-enabled --log-driver=journald'
[vagrant@localhost ~]$ cat /etc/sysconfig/docker-storage | grep DOCKER_STORAGE_OPTIONS=
DOCKER_STORAGE_OPTIONS="-s overlay"
[vagrant@localhost ~]$ sudo journalctl -u docker --no-pager
-- Logs begin at Sat 2015-12-26 10:24:57 UTC, end at Tue 2016-03-15 20:02:41 UTC. --
Mar 15 20:01:08 localhost.localdomain systemd[1]: Starting Docker Application Container Engine...
Mar 15 20:01:08 localhost.localdomain docker[19914]: time="2016-03-15T20:01:08.973114867Z" level=fatal msg="Error starting daemon: SELinux is not supported with the overlay graph driver"
Mar 15 20:01:08 localhost.localdomain systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Mar 15 20:01:08 localhost.localdomain systemd[1]: Failed to start Docker Application Container Engine.
Mar 15 20:01:08 localhost.localdomain systemd[1]: docker.service: Unit entered failed state.
Mar 15 20:01:08 localhost.localdomain systemd[1]: docker.service: Failed with result 'exit-code'.

Setting --selinux-enabled=false in the options list allows me to start Docker and fixes the issue noted in the OP.

@MaxenceAdnot can you confirm that disabling SELinux support on your CoreOS cluster fixes the issue?

bacongobbler commented 8 years ago

This is also something we should bring up with CoreOS as well so their kubernetes clusters work OOTB.

carmstrong commented 8 years ago

I'm seeing this as well on kube-aws. Details:

core@ip-10-0-0-50 ~ $ uname -a
Linux ip-10-0-0-50.us-west-2.compute.internal 4.4.3-coreos #2 SMP Thu Mar 3 23:21:00 UTC 2016 x86_64 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz GenuineIntel GNU/Linux

core@ip-10-0-0-50 ~ $ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=976.0.0
VERSION_ID=976.0.0
BUILD_ID=2016-03-03-2324
PRETTY_NAME="CoreOS 976.0.0 (Coeur Rouge)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

core@ip-10-0-0-50 ~ $ docker --version
Docker version 1.10.2, build eb1bdb1

I'm using the latest kube-aws release as well as @slack's modified chart (using the latest from deis-dev) to no avail.

Investigating the Docker journal:

$ journalctl -u docker --no-pager

The following seems to be relevant:

Mar 16 05:50:34 ip-10-0-0-50.us-west-2.compute.internal dockerd[1241]: time="2016-03-16T05:50:34.686537706Z" level=warning msg="DEPRECATED: Setting host configuration options when the container starts is deprecated and will be removed in Docker 1.12"
Mar 16 05:50:34 ip-10-0-0-50.us-west-2.compute.internal dockerd[1241]: time="2016-03-16T05:50:34.800687111Z" level=warning msg="HostsPath set to \"/var/lib/docker/containers/4b28d46ca43b7b6752761225e3d1ec129653b4dce07769386a59a2a5d464eb21/hosts\", but can't stat this filename (err = stat /var/lib/docker/containers/4b28d46ca43b7b6752761225e3d1ec129653b4dce07769386a59a2a5d464eb21/hosts: no such file or directory); skipping"
Mar 16 05:50:34 ip-10-0-0-50.us-west-2.compute.internal dockerd[1241]: time="2016-03-16T05:50:34.956670266Z" level=warning msg="signal: killed"
Mar 16 05:50:35 ip-10-0-0-50.us-west-2.compute.internal dockerd[1241]: time="2016-03-16T05:50:35.012629347Z" level=warning msg="HostsPath set to \"/var/lib/docker/containers/4b28d46ca43b7b6752761225e3d1ec129653b4dce07769386a59a2a5d464eb21/hosts\", but can't stat this filename (err = stat /var/lib/docker/containers/4b28d46ca43b7b6752761225e3d1ec129653b4dce07769386a59a2a5d464eb21/hosts: no such file or directory); skipping"
Mar 16 05:50:35 ip-10-0-0-50.us-west-2.compute.internal dockerd[1241]: time="2016-03-16T05:50:35.019676616Z" level=error msg="error locating sandbox id 29c8df24230e8e7f732b3eeff8c02150198fb212b015d44857cf4ab8383f58c2: sandbox 29c8df24230e8e7f732b3eeff8c02150198fb212b015d44857cf4ab8383f58c2 not found"
Mar 16 05:50:35 ip-10-0-0-50.us-west-2.compute.internal dockerd[1241]: time="2016-03-16T05:50:35.019723965Z" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /var/lib/docker/containers/4b28d46ca43b7b6752761225e3d1ec129653b4dce07769386a59a2a5d464eb21/shm: invalid argument"
Mar 16 05:50:35 ip-10-0-0-50.us-west-2.compute.internal dockerd[1241]: time="2016-03-16T05:50:35.019745608Z" level=error msg="Error unmounting container 4b28d46ca43b7b6752761225e3d1ec129653b4dce07769386a59a2a5d464eb21: not mounted"
Mar 16 05:50:35 ip-10-0-0-50.us-west-2.compute.internal dockerd[1241]: time="2016-03-16T05:50:35.019837256Z" level=warning msg="HostsPath set to \"/var/lib/docker/containers/4b28d46ca43b7b6752761225e3d1ec129653b4dce07769386a59a2a5d464eb21/hosts\", but can't stat this filename (err = stat /var/lib/docker/containers/4b28d46ca43b7b6752761225e3d1ec129653b4dce07769386a59a2a5d464eb21/hosts: no such file or directory); skipping"
Mar 16 05:50:35 ip-10-0-0-50.us-west-2.compute.internal dockerd[1241]: time="2016-03-16T05:50:35.019898767Z" level=error msg="Handler for POST /containers/4b28d46ca43b7b6752761225e3d1ec129653b4dce07769386a59a2a5d464eb21/start returned error: Container command not found or does not exist."

Note that selinux is enabled by the /usr/lib/coreos/dockerd wrapper on CoreOS. There is a configuration directive, ARG_SELINUX, which can be added to /run/flannel_docker_opts.env:

ARG_SELINUX="nowayjose"

Restarting dockerd with sudo systemctl restart docker resolves this for the main Docker daemon - however, early-docker.service is unaffected. That one is harder to patch, it seems.

carmstrong commented 8 years ago

dmesg also shows:

[  153.677041] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  162.133212] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  163.069155] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  168.453115] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  212.856207] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  218.218618] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  219.433233] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  237.072790] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  237.242488] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  237.377854] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  237.770179] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)

MaxenceAdnot commented 8 years ago

With the beta channel, @carmstrong and I experience some issues with the kube-aws tool. No running container with a docker ps, kubernetes is not initialized ...

bacongobbler commented 8 years ago

Just for posterity, I did notice at their documentation that the only Release Channel value that is supported at this time is alpha, so that seems to be the cause for the issues you're having on the beta channel.

slack commented 8 years ago

Yep, Beta worked without issue until the recent kubelet-wrapper merges (which don't exist in beta). That is why k8s fails to boot using v0.4.1 of kube-aws on Beta.

arkkanoid commented 8 years ago

Hi, It seems I have the same error I explain in: https://github.com/deis/workflow/issues/210 I use kubernetes 1.2, CoreOS alpha (1010.1.0), Deis beta2, kube-aws 0.6.0 This error should be solved?

bacongobbler commented 8 years ago

@arkkanoid the root issue is https://github.com/coreos/coreos-kubernetes/issues/317. There's nothing we can do on our end other than contribute upstream.

mattk42 commented 8 years ago

FYI, I ran into what I think was the same thing on GKE running Kubernetes 1.2.6 on workflow 2.6.0. I have to run 1.2.6 for my own reasons, this doesn't seem to be a problem in 1.3.X or 1.4.0 from what I can tell.

deis / postgres

CrashLoopBackOff on the database #63