eclipse-che / che

Kubernetes based Cloud Development Environments for Enterprise Teams
http://eclipse.org/che
Eclipse Public License 2.0
6.99k stars 1.19k forks source link

postgres data gone after minikube node reboot #15065

Closed gattytto closed 4 years ago

gattytto commented 5 years ago

Describe the bug

rebooting the minikube node hosting a che env, postgres pod's /var/lib/pgsql/data is gone, postgres and keycloak pods go BackOff

Che version

Steps to reproduce

anything that causes the minikube node to reboot (be it gracefully or a hard reset)

Expected behavior

I expect the che context to be brought back up with postgres and keycloak pods loading the pre-existing database until I decide to issue chectl:delete

Runtime

Screenshots

Installation method

Environment

Additional context

the PersistentVolume implemented by chectl to start the postgres should use a path beginning with /data to avoid minikube earsing its content upon a node hard-reset.

hostpath field "path:" set to empty when defining a PersistentVolume causes minikube default StorageClass implementation to use /tmp/hostpath-provisioner/ as the folder, which gets emptied upon reboots according to https://minikube.sigs.k8s.io/docs/reference/persistent_volumes/

if this gets sorted out I could go on and run test-scenarios for the workspace pods too.

$ kubectl get pv pvc-90a86e5a-a7d8-43b5-9bae-9e1064f9df0b -o yaml


apiVersion: v1 kind: PersistentVolume metadata: annotations: hostPathProvisionerIdentity: 47e548c5-fca5-11e9-9417-02427d267bb8 pv.kubernetes.io/provisioned-by: k8s.io/minikube-hostpath creationTimestamp: "2019-11-01T15:56:33Z" finalizers:

gattytto commented 5 years ago

https://github.com/kubernetes/minikube/issues/3582#issuecomment-459964039

ibuziuk commented 5 years ago

Looks like related to disaster recovery - https://github.com/eclipse/che/issues/14240 @gattytto thanks for reporting and looks like you did a pretty good analysis. Will you be interested in contributing a fix?

the PersistentVolume implemented by chectl to start the postgres should use a path beginning with /data to avoid minikube earsing its content upon a node hard-reset.

hostpath field "path:" set to empty when defining a PersistentVolume causes minikube default StorageClass implementation to use /tmp/hostpath-provisioner/ as the folder, which gets emptied upon reboots according to https://minikube.sigs.k8s.io/docs/reference/persistent_volumes/

if this gets sorted out I could go on and run test-scenarios for the workspace pods too.

gattytto commented 5 years ago

@ibuziuk yes partially, I’m in testing phase but it can be done

gattytto commented 5 years ago

I need some help, please. I will provide reproduction steps. First of all this is specific to minikube+chectl deployment of che.

so far I did code changes in https://github.com/gattytto/che-operator and started the deployment using: chectl server:start -m -p minikube --che-operator-image=quay.io/gattytto/che-operator:latest -t /usr/local/lib/chectl/templates

one part of the change is to controller code adding the persistentVolume, and there's also a storageClass in https://github.com/gattytto/che-operator/blob/master/deploy/storageclass.yaml with which I had to use kubectl command to add it to the cluster, because for some reason the dashboard doesn't accept it (but CMDLine kubectl does). the storage class is hardcoded to the persistentVolumeClaim(PVC) and the persistentVolume(PV) because the PVC gets the standard one when created without specific storageclass and PV gets none. I see the argument to use a specific storage class but for the time I just hardcoded it.

chectl yaml files for role.yaml and cluster-role.yaml had the addition of the persistentvolumes resource, I have edited the ones in https://github.com/gattytto/che-operator/blob/master/deploy/role.yaml and /cluster-role.yaml respectively and copied them to: /usr/local/lib/chectl/templates/che-operator/ so chectl uses them when starting the deployment.

I have manually created /data/minikube folder and set permission to 777, the operator startup process effectively creates the subfolder "userdata", which holds the postgres db files and has the expected user rights for UID=26 and GID=26. THIS PART IS IMPORTANT, because the PersistentVolume type is DirectoryOrCreate, and since in the scenario that minikube is using the vm-driver=none tag (running inside LXC container), minikube is running as root and the directory minikube inside /data will be created with root:root rights. so That's why I pre-created it and set the rights to 777. this will be fixable from code when minikube team implements the "mountoptions" property for persistentVolumes in minikube.

Part of the process gets done and it gets stuck before deploying the plugin registry. I don't know why and I also don't know how to further debug / test why the operator is stopping the deplyment process. As seen in the screenshot, what I CAN be sure of, is that both keycloak and postgres pods are started and healthy, I have also accessed keycloak-che url and successfully logged in as admin:admin.

image

image

image

gattytto commented 5 years ago

and it works after a hard reset of the LXC container, at least what was started, comes back. image

sleshchenko commented 5 years ago

@gattytto Could you share che-operator logs. AFAIK che-operator do some exec in keycloak, maybe it's failed.

gattytto commented 5 years ago

I have finished the code modifications to persist postgres data and it works.

After a hard reset of the LXC container, postgres, keycloack and che come back.

as for Workspaces: they don't, because their storage got deleted by minikube

image

gattytto commented 4 years ago

it seems like persistentvolumeclaim provisioning is split in half for the kubernetes use-case, che-operator provisions postgres-data volume and che-server follows config values set in volumeclaimStrategy and uses java code to make the volumes for the workspaces. Could this be moved to che-operator golang code instead?

simha369 commented 4 years ago

I am still facing the same issue, Persistent volume Postgres data lost after minikube stop. Do we have a solution for this problem? please share. If this is working in an earlier minikube version. please share the working minikube version. i am facing issue in minikube version: v1.5.2

gattytto commented 4 years ago

@simha369 no there's no fix but I have filed a feature request https://github.com/eclipse/che/issues/15157 .. you can patch the che-operator code to persist your postgres database and general info (like ssh keys?) from your dev env, but after a hard reset you would still need to recreate (delete and create again) the workspaces from your devfiles registry or using factories. So depending on what you need to persist there is a workaround or not (for the moment)

AndrienkoAleksandr commented 4 years ago

@gattytto Join to review, please https://github.com/eclipse/che-operator/pull/144

tolusha commented 4 years ago

@gattytto Do you think we can close the issue?

gattytto commented 4 years ago

I'm very happy to say yes