Openshift 3.10 pod has unbound PersistentVolumeClaims (GlusterFS)

jburleson commented 6 years ago

Eclipse Che: Nightly (currently 6.14.0 -- pulled this morning) OKD/Openshift 3.10 Dynamic Storage Provisioning with GlusterFS GlusterFS storage class is set as default.

When creating a workspace, the workspace always fails the first time with:

"Unrecoverable event occurred: 'FailedScheduling', 'pod has unbound PersistentVolumeClaims (repeated 3 times)'"

If I wait for about 5-10 seconds and then start the container again, it starts up fine.

I can watch the workspace as it is created in Openshift and I can see the PVC. They are just taking about 5-10 seconds to be create and bind. The Openshift documentations (https://docs.okd.io/3.10/install_config/persistent_storage/persistent_storage_glusterfs.html#considerations-volume-ops) mentions that extra time might have to be considered for the volume to be created and bound to the corresponding PVC.

Would it be possible to add or increase the delay to the workspace creation? The PVCs are being creating just not as quickly as Che would like them.

ghost commented 6 years ago

@jburleson timeout should be 3-5 mins. Or do you say that workspace start is terminated after the first such event, but pvc gets eventually bound after a few attempts?

jburleson commented 6 years ago

The workspace start terminates within the first 20 seconds. From the error message it indicates that it try to bind the PVC three times. I can check the workspace in the Openshift console and the PVCs are created and bound just not quick enough for the workspace startup.

Here is what I am doing.

1) Create workspace 2) Start workspace 3) Workspace stops within 20 seconds with failure: "Unrecoverable event occurred: 'FailedScheduling', 'pod has unbound PersistentVolumeClaims (repeated 3 times)'" 4) Wait 5-10 seconds and click retry. 5) Workspace starts up correctly.

This only happens when the workspace is first created. After that it always starts up fine.

ghost commented 6 years ago

@amisevsk @ibuziuk do we terminate workspace start immediately after an unrecoverable event is caught?

amisevsk commented 6 years ago

@eivantsov Yes I believe any unrecoverable event causes the workspace start to be immediately cancelled.

I think by default failure to get PVC is viewed as completely unrecoverable, although this may not be the case. I'm not sure what the retry looks like on the OpenShift side. Part of the issue though is that I think multiple different OpenShift issues get lumped in to one event, and so some recoverable events are treated as unrecoverable because they look like a "failed mount" or something.

jburleson commented 6 years ago

Would it be possible to increase the wait time for the pvc to bind? Once the workspace creation starts the pvc does continue the bind process after the error message. That is why I can wait 5-10 seconds and then retry without any issues. By that time the claim has bound and the creation process completes.

amisevsk commented 6 years ago

@jburleson I think the issue is that OpenShift is creating the failed to mount event, and Che is set to treat that as an unrecoverable error. I'm not sure if there's an option on OpenShift to change the timeout for the failed scheduling event.

What you could try is changing which OpenShift events are viewed as unrecoverable: there's a che property that lists which OpenShift/Kubernetes events will cause workspace start to fail. You can override it by setting an environment variable on the server deployment. Could you try something like

CHE_INFRA_KUBERNETES_WORKSPACE__UNRECOVERABLE__EVENTS="FailedMount,MountVolume.SetUp failed,Failed to pull image"

(note the double _ chars -- see the docs for more information).

The downside with doing this is that legitimate blockers for workspace launch won't be picked up until the standard 5 minute timeout for workspace launch, which could potentially be a slightly worse user experience.

jburleson commented 6 years ago

@akervern Is Openshift issuing the failed to mount or is Che checking to see if the mount is ready and then failing since it has not bound yet? From the Openshift side, the bind does not fail. Once Che requests the PVC it does gets created in Openshift without any errors. From the Openshift web console I can see the project space be created and the pvc be created. When the Che workspace start terminates, the pvc bind process is still running.

amisevsk commented 6 years ago

@jburleson I'm not entirely certain on this, but in my experience OpenShift will issue events even if the process is ongoing. It is likely a non-critical event that OpenShift tried to bind the PVC, but it is not ready, so it will retry in a little while. I consistently see similar behaviour if e.g. I'm trying to use an image that hasn't been pushed to a registry yet.

This might just be an issue in how we handle unrecoverable events, but I don't know if there is a way to distinguish on the OpenShift API side which events are true fails and which are temporary failures.

jburleson commented 6 years ago

@amisevsk Gotcha. I was not sure either. This is the first OpenShift cluster I have deployed so I am still learning about it.

I will close this issue.

Thanks for your patients.

amisevsk commented 6 years ago

@jburleson no worries, I'm happy to help! Did env var help solve your problem?

jburleson commented 6 years ago

@amisevsk Yes, removing FailedScheduling from the list does allow the workspace creation to complete.

Loading...
pod has unbound PersistentVolumeClaims (repeated 3 times)
pod has unbound PersistentVolumeClaims (repeated 3 times)
pod has unbound PersistentVolumeClaims (repeated 3 times)
pod has unbound PersistentVolumeClaims (repeated 3 times)
Successfully assigned workspacea6i4vdo58z8tec8a.dockerimage-77fdf64fc7-2bnsm to 
pulling image "jburleson/che_stacks:gcc_clang"
Successfully pulled image "jburleson/che_stacks:gcc_clang"
Created container
Started container

I am talking with some of the other faculty who will be using this and we are considering leaving it the FailedScheduling off. With it off, the students would not have to do anything but if we add it back then they will always have to click on retry. While it does mean that other failures will not cause an immediate failure, we think this might be an acceptable trade off.

eclipse-che / che

Openshift 3.10 pod has unbound PersistentVolumeClaims (GlusterFS) #11781