coreos / vault-operator

Run and manage Vault on Kubernetes simply and securely
https://coreos.com/blog/introducing-vault-operator-project
Apache License 2.0
758 stars 110 forks source link

Vault cluster doesn't start on Macbook->Openshift->myproject #315

Open rawipfel opened 6 years ago

rawipfel commented 6 years ago

Hi, I'm trying out the vault+etcd-operator on a Macbook running Docker 18.03.1-ce-mac65 (24312) and Openshift origin v3.9.0. Starting from a clean installation and master branch of vault-operator and etcd-operator repos:

Roberts-MacBook-Pro:Desktop rwipfel$ ./runVault.sh
++ oc login -u system:admin
Logged into "https://127.0.0.1:8443" as "system:admin" using existing credentials.

You have access to the following projects and can switch between them with 'oc project <projectname>':

    default
    kube-public
    kube-system
  * myproject
    openshift
    openshift-infra
    openshift-node
    openshift-web-console

Using project "myproject".
++ oc patch scc restricted -p '{"fsGroup":{"type":"RunAsAny"}}'
securitycontextconstraints "restricted" patched
++ oc patch scc restricted -p '{"runAsUser":{"type":"RunAsAny"}}'
securitycontextconstraints "restricted" patched
++ cd /Users/rwipfel/git/etcd-operator/
++ example/rbac/create_role.sh --namespace=myproject
Creating role with ROLE_NAME=etcd-operator, NAMESPACE=myproject
clusterrole.rbac.authorization.k8s.io "etcd-operator" created
Creating role binding with ROLE_NAME=etcd-operator, ROLE_BINDING_NAME=etcd-operator, NAMESPACE=myproject
clusterrolebinding.rbac.authorization.k8s.io "etcd-operator" created
++ cd /Users/rwipfel/git/vault-operator/
++ sed -e 's/<namespace>/myproject/g' -e 's/<service-account>/default/g' example/rbac-template.yaml
++ kubectl create -f example/rbac.yaml
role.rbac.authorization.k8s.io "vault-operator-role" created
rolebinding.rbac.authorization.k8s.io "vault-operator-rolebinding" created
++ kubectl create -f example/etcd_crds.yaml
customresourcedefinition.apiextensions.k8s.io "etcdclusters.etcd.database.coreos.com" created
customresourcedefinition.apiextensions.k8s.io "etcdbackups.etcd.database.coreos.com" created
customresourcedefinition.apiextensions.k8s.io "etcdrestores.etcd.database.coreos.com" created
++ kubectl create -f example/etcd-operator-deploy.yaml
deployment.extensions "etcd-operator" created
++ kubectl create -f example/vault_crd.yaml
customresourcedefinition.apiextensions.k8s.io "vaultservices.vault.security.coreos.com" created
++ kubectl create -f example/deployment.yaml
deployment.extensions "vault-operator" created
++ sleep 5
++ kubectl get deploy
NAME             DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
etcd-operator    1         1         1            0           6s
vault-operator   1         1         1            0           6s
++ kubectl create -f example/example_vault.yaml
vaultservice.vault.security.coreos.com "example" created
++ sleep 5
++ kubectl get pods
NAME                              READY     STATUS              RESTARTS   AGE
etcd-operator-7bf6b58cdf-j5sk2    3/3       Running             0          12s
vault-operator-67d5846657-bcsd2   0/1       ContainerCreating   0          12s

Roberts-MacBook-Pro:Desktop rwipfel$ kubectl get pods
NAME                              READY     STATUS    RESTARTS   AGE
etcd-operator-7bf6b58cdf-j5sk2    3/3       Running   0          40s
example-etcd-mf52q4mwlr           1/1       Running   0          9s
example-etcd-tvglk9h5fk           1/1       Running   0          25s
vault-operator-67d5846657-bcsd2   1/1       Running   0          40s

There isn't anything obviously wrong in logs. The etcd cluster is running properly.

Roberts-MacBook-Pro:Desktop rwipfel$ kubectl logs vault-operator-67d5846657-bcsd2
time="2018-05-03T14:23:25Z" level=info msg="Go Version: go1.9.2"
time="2018-05-03T14:23:25Z" level=info msg="Go OS/Arch: linux/amd64"
time="2018-05-03T14:23:25Z" level=info msg="vault-operator Version: 0.1.9"
time="2018-05-03T14:23:25Z" level=info msg="Git SHA: 43a1dd7"
ERROR: logging before flag.Parse: I0503 14:23:25.710514       1 leaderelection.go:174] attempting to acquire leader lease...
ERROR: logging before flag.Parse: I0503 14:23:25.724311       1 leaderelection.go:184] successfully acquired lease myproject/vault-operator
time="2018-05-03T14:23:25Z" level=info msg="Event(v1.ObjectReference{Kind:\"Endpoints\", Namespace:\"myproject\", Name:\"vault-operator\", UID:\"8abd113d-4edd-11e8-9c89-025000000001\", APIVersion:\"v1\", ResourceVersion:\"1477\", FieldPath:\"\"}): type: 'Normal' reason: 'LeaderElection' vault-operator-67d5846657-bcsd2 became leader"
time="2018-05-03T14:23:25Z" level=info msg="starting Vaults controller"
time="2018-05-03T14:23:25Z" level=info msg="Vault CR (myproject/example) is created"
Roberts-MacBook-Pro:Desktop rwipfel$ kubectl logs etcd-operator-7bf6b58cdf-j5sk2 etcd-operator
time="2018-05-03T14:23:21Z" level=info msg="etcd-operator Version: 0.8.3"
time="2018-05-03T14:23:21Z" level=info msg="Git SHA: 85c37511"
time="2018-05-03T14:23:21Z" level=info msg="Go Version: go1.9.2"
time="2018-05-03T14:23:21Z" level=info msg="Go OS/Arch: linux/amd64"
time="2018-05-03T14:23:21Z" level=info msg="Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"myproject", Name:"etcd-operator", UID:"887acaa2-4edd-11e8-9c89-025000000001", APIVersion:"v1", ResourceVersion:"1428", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' etcd-operator-7bf6b58cdf-j5sk2 became leader"
2018-05-03 14:23:27.078742 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
time="2018-05-03T14:23:27Z" level=info msg="creating cluster with Spec:" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="{" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="    "size": 3," cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="    "repository": "quay.io/coreos/etcd"," cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="    "version": "3.2.13"," cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="    "pod": {" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="        "resources": {}," cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="        "etcdEnv": [" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="            {" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="                "name": "ETCD_AUTO_COMPACTION_RETENTION"," cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="                "value": "1"" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="            }" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="        ]" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="    }," cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="    "TLS": {" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="        "static": {" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="            "member": {" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="                "peerSecret": "example-etcd-peer-tls"," cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="                "serverSecret": "example-etcd-server-tls"" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="            }," cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="            "operatorSecret": "example-etcd-client-tls"" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="        }" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="    }" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="}" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="cluster created with seed member (example-etcd-tvglk9h5fk)" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:27Z" level=info msg="start running..." cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:35Z" level=info msg="skip reconciliation: running ([]), pending ([example-etcd-tvglk9h5fk])" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:43Z" level=info msg="Start reconciling" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:43Z" level=info msg="running members: example-etcd-tvglk9h5fk" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:43Z" level=info msg="cluster membership: example-etcd-tvglk9h5fk" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:43Z" level=info msg="added member (example-etcd-mf52q4mwlr)" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:43Z" level=info msg="Finish reconciling" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:51Z" level=info msg="Start reconciling" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:51Z" level=info msg="running members: example-etcd-tvglk9h5fk,example-etcd-mf52q4mwlr" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:51Z" level=info msg="cluster membership: example-etcd-tvglk9h5fk,example-etcd-mf52q4mwlr" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:51Z" level=info msg="Finish reconciling" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:51Z" level=error msg="failed to reconcile: fail to add new member (example-etcd-gs9vq5sjs5): etcdserver: unhealthy cluster" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:59Z" level=info msg="Start reconciling" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:59Z" level=info msg="running members: example-etcd-mf52q4mwlr,example-etcd-tvglk9h5fk" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:59Z" level=info msg="cluster membership: example-etcd-mf52q4mwlr,example-etcd-tvglk9h5fk" cluster-name=example-etcd pkg=cluster
time="2018-05-03T14:23:59Z" level=info msg="added member (example-etcd-8xsqs4nc8j)" cluster-name=example-etcd pkg=cluster

The vault-operator shows this:

Roberts-MacBook-Pro:Desktop rwipfel$ kubectl logs vault-operator-67d5846657-bcsd2
time="2018-05-03T14:23:25Z" level=info msg="Go Version: go1.9.2"
time="2018-05-03T14:23:25Z" level=info msg="Go OS/Arch: linux/amd64"
time="2018-05-03T14:23:25Z" level=info msg="vault-operator Version: 0.1.9"
time="2018-05-03T14:23:25Z" level=info msg="Git SHA: 43a1dd7"
ERROR: logging before flag.Parse: I0503 14:23:25.710514       1 leaderelection.go:174] attempting to acquire leader lease...
ERROR: logging before flag.Parse: I0503 14:23:25.724311       1 leaderelection.go:184] successfully acquired lease myproject/vault-operator
time="2018-05-03T14:23:25Z" level=info msg="Event(v1.ObjectReference{Kind:\"Endpoints\", Namespace:\"myproject\", Name:\"vault-operator\", UID:\"8abd113d-4edd-11e8-9c89-025000000001\", APIVersion:\"v1\", ResourceVersion:\"1477\", FieldPath:\"\"}): type: 'Normal' reason: 'LeaderElection' vault-operator-67d5846657-bcsd2 became leader"
time="2018-05-03T14:23:25Z" level=info msg="starting Vaults controller"
time="2018-05-03T14:23:25Z" level=info msg="Vault CR (myproject/example) is created"

I'm not sure where to look next?

(As a guess I tried creating custom TLS certificates per https://github.com/coreos/vault-operator/blob/master/doc/user/tls_setup.md but that made no difference)

I'd be grateful for any help, and willing to contribute once I learn more about how to operate these operators :)

hasbro17 commented 6 years ago

@rawipfel Can you check if in your example the Vault Deployment example has been created by the vault-operator. If yes then it's an issue with the restricted SCC rejecting the Vault Deployment pods.

Currently the vault-operator configures Vault containers with the IPC_LOCK capability. https://github.com/coreos/vault-operator/blob/master/pkg/util/k8sutil/vault.go#L167-L173

The restricted SCC does not allow pods with this capability. If you check the Deployment status for the example Vault deployment you should be able to see the pods being rejected.

Can you try updating the restricted SCC to grant it the IPC_LOCK capability and then try again:

kind: SecurityContextConstraints
apiVersion: v1
metadata:
  name: restricted
  ...
allowedCapabilities:
- IPC_LOCK
...

However this is just a work around since changing the restricted SCC is not a good idea.

A more proper solution to this issue is to either: a) Remove the need for IPC_LOCK https://github.com/coreos/vault-operator/issues/311 but that needs more thought. b) Be able to configure service accounts for the vault pods via the the VaultService CR's spec.PodPolicy so that they can use a dedicated service account and SCC that allows the IPC_LOCK capability.

rawipfel commented 6 years ago

Thanks @hasbro17 that was the problem, the Vault Deployment example wasn't working:

Roberts-MacBook-Pro:Desktop rwipfel$ kubectl get pod
NAME                              READY     STATUS    RESTARTS   AGE
etcd-operator-7bf6b58cdf-rs9vf    3/3       Running   0          12m
example-etcd-2cvxzp5hzk           1/1       Running   0          11m
example-etcd-89smzphhnl           1/1       Running   0          11m
example-etcd-w8v4mdjcxh           1/1       Running   0          12m
vault-operator-67d5846657-82bwp   1/1       Running   0          12m
Roberts-MacBook-Pro:Desktop rwipfel$ kubectl get deploy
NAME             DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
etcd-operator    1         1         1            1           12m
example          2         0         0            0           11m
vault-operator   1         1         1            1           12m

This is my entire startup script, is now working, after allowing IPC_LOCK:

Roberts-MacBook-Pro:Desktop rwipfel$ cat runVault.sh
set -x
oc login -u system:admin
oc patch scc restricted -p '{"fsGroup":{"type":"RunAsAny"}}'
oc patch scc restricted -p '{"runAsUser":{"type":"RunAsAny"}}'
oc patch scc restricted -p '{"allowedCapabilities":["IPC_LOCK"]}'
cd ~/git/etcd-operator/
example/rbac/create_role.sh --namespace=myproject
cd ~/git/vault-operator/
sed -e 's/<namespace>/myproject/g' \
    -e 's/<service-account>/default/g' \
    example/rbac-template.yaml > example/rbac.yaml
kubectl create -f example/rbac.yaml
kubectl create -f example/etcd_crds.yaml
kubectl create -f example/etcd-operator-deploy.yaml
kubectl create -f example/vault_crd.yaml
kubectl create -f example/deployment.yaml
sleep 5 && kubectl get deploy
kubectl create -f example/example_vault.yaml
sleep 5 && kubectl get pods

Many thanks, It's working now :)

Roberts-MacBook-Pro:Desktop rwipfel$ kubectl get pods
NAME                              READY     STATUS    RESTARTS   AGE
etcd-operator-7bf6b58cdf-xf6xp    3/3       Running   0          2m
example-5f68dbcdf4-29jqf          1/2       Running   0          55s
example-5f68dbcdf4-l9glp          1/2       Running   0          55s
example-etcd-2vcphl4hkr           1/1       Running   0          1m
example-etcd-7wn782cn29           1/1       Running   0          1m
example-etcd-cb8kqnjrpz           1/1       Running   0          1m
vault-operator-67d5846657-mhq6q   1/1       Running   0          1m

I guess #311 is a question of demo/dev/eval vs. production deployment. It seems reasonable to document the workaround for demo/dev/eval, but require IPC_LOCK by default for secure production deployments. Agree that changing the restricted SCC isn't a good idea, and maybe there will be other reasons for configurable service accounts in future...

rawipfel commented 6 years ago

Hi @hasbro17, I will submit a PR to update the README with a description of the above workaround, if that's an acceptable way to resolve this, please lmk...