elastic / ansible-elastic-cloud-enterprise

Ansible playbooks for Elastic Cloud Enterprise (ECE)
https://www.elastic.co/products/ece
Other
61 stars 61 forks source link

ECE 2.13+ and 3.x does not bootstrap on SLES #155

Open obierlaire opened 2 years ago

obierlaire commented 2 years ago

Starting 2.13 and above (including 3.0 and above), ECE does not bootstrap on SLES 12 and 15, with docker 19 or 20:

Details

bootstrap logs:

- Starting local runner {}
- Started local runner {}
- Waiting for runner container node {}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Errors have caused Elastic Cloud Enterprise installation to fail - Please check logs 
  Node type - initial
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

in docker logs of runner:

ok: run: docker-socket-proxy: (pid 30) 2s
Traceback (most recent call last):
  File "/elastic_cloud_apps/runner/write_config.py", line 10, in <module>
    with open('runner.conf', 'w') as dest:
PermissionError: [Errno 13] Permission denied: 'runner.conf'

What I noticed is ece user is well in passwd and group, and elastic well belongs to ece group! So this failure should not happen.

elastic:x:1000:1000::/home/elastic:/bin/false
ece:x:199:199::/home/ece:/bin/bash
ece:x:199:elastic
elastic:x:1000:

Indeed, path to runner.conf :

$ ls -lah /elastic_cloud_apps/runner
total 16K
drwxrwxr-x 1 199     199       65 Apr 28 14:36 .

On ubuntu, user ece is well set as owner of /elastic_cloud_apps/runner, but on SLES it shows its uid 199 For bootstrapper docker container, it's well displayed ece and not its uid

Also, the following command does not work:

$ setuser ece whoami
setuser: user ece not found

This does not make sense as ece user is well defined in /etc/passwd Again, it's all good on ubuntu and on SLES from inside boostrapper container

My guess is that docker have issues with mapping uid/gid between the host and the container. Indeed, the user/group ece does not exists on the host. And so, elastic does not belong to group ece on the host.

Workaround

On the host, create a user and group named ece with uid/gid both 199, and add user elastic to ece group. Then run ECE installer, and that should work!

obierlaire commented 2 years ago

Since 2.13+, with https://github.com/elastic/cloud/pull/82702, we are mounting /run into the runner container.

Also, even if we workaround the uid/gid problem (cf decription), I noticed that runner cannot talk to zookeeper and so is still not detected as running. If you log into runner container, hosts are not resolved anymore:

$ ping containerhost
ping: containerhost: Name or service not known

While containerhost is well described in :

[root@831b834c4693 /]# cat /etc/hosts
127.0.0.1   localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.31.3.6  containerhost
172.17.0.1  831b834c4693

I noticed that bootstrapper mounts /run into the container since 2.13 and if I start the runner container manually without mounting /run, I can well resolve hosts.

obierlaire commented 2 years ago

A workaround that is working for uid/gid problem and /etc/hosts problem seems to be disabling/uninstalling nscd: https://github.com/elastic/ansible-elastic-cloud-enterprise/pull/156