eclipse-che / che

Kubernetes based Cloud Development Environments for Enterprise Teams
http://eclipse.org/che
Eclipse Public License 2.0
6.99k stars 1.19k forks source link

Unable to start a workspace on fresh minikube/che install [v7.8.0] #16003

Closed Jarthianur closed 4 years ago

Jarthianur commented 4 years ago

Describe the bug

I've created a local minikube cluster following the documentation, and deployed che to it using chectl. When creating and opening a workspace from any stack, the following log appears until it fails in a timeout.

Successfully assigned che/workspacevljc0zfnpzuh4u76.che-plugin-broker to minikube
Pulling image "quay.io/eclipse/che-plugin-metadata-broker:v3.1.0"
Successfully pulled image "quay.io/eclipse/che-plugin-metadata-broker:v3.1.0"
Created container che-plugin-metadata-broker-v3-1-0
Started container che-plugin-metadata-broker-v3-1-0
Error: Failed to run the workspace: "Plugins installation process timed out"

I've put some further investigation to the "additional context" section.

Che version

Steps to reproduce

  1. Spin up minikube minikube start --vm-driver=kvm2 --memory=6144 --cpus=6 --disk-size=32gb --dns-domain='kube.lab'
  2. Start che chectl server:start --platform minikube --domain "kube.lab"
  3. Put che hostnames in /etc/hosts on workstation
  4. Visit the che server and start a workspace

Expected behavior

The workspace should start and open correctly.

Runtime

Screenshots

Installation method

Environment

Additional context

I've tried this on recently set up latest debian buster and latest manjaro linux (amd64). Helm, chectl, minikube and system apckages are all up to date. Host firewall is turned off for testing. KVM is properly running and configured. Changing the domain, or leaving it on default has no effect on the current behavior.

The kubernetes dashboard shows all PODs/services to be running and healthy, but when starting a workspace in che, the workspace POD fails with following log.

2020/02/12 11:50:31 Broker configuration
2020/02/12 11:50:31   Push endpoint: ws://che-che.kube.lab/api/websocket
2020/02/12 11:50:31   Auth enabled: false
2020/02/12 11:50:31   Runtime ID:
2020/02/12 11:50:31     Workspace: workspacevljc0zfnpzuh4u76
2020/02/12 11:50:31     Environment: default
2020/02/12 11:50:31     OwnerId: che
2020/02/12 11:50:31 Couldn't connect to endpoint 'ws://che-che.kube.lab/api/websocket', due to error 'dial tcp: lookup che-che.kube.lab on 10.96.0.10:53: no such host'

The kube-dns service is running, and mapped to two coredns PODs. I've created a dnsutils POD in namespace che to investigate possible DNS issues. From there neither nslookup, nor dig can resolve the che server hostname no matter what domain to query for.

# nslookup che-che.kube.lab
Server:         10.96.0.10
Address:        10.96.0.10#53

** server can't find che-che.kube.lab: NXDOMAIN

I don't know whether this is just a configuration issue, or a fault in che, or minikube. As I'm new to kubernetes/che, I don't know where to go from here. Searching the web results in many issues related to older versions of minikube/che and their fixes do not work at this point. Also the che and minikube documentations provide absolutely no hint on this, or even exact configuration steps. So I'm considering the described steps there as a correct how-to (at least for first steps), but they are simply not working (no rant, just a perception). It would be great if we can find a solution on this, and maybe extend the documentation by all necessary steps.

IveJ commented 4 years ago

Hi Jarthianur,

Plz try to inspect err follow this guide:

https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/

On Wed, Feb 12, 2020, 19:30 Jarthianur notifications@github.com wrote:

Describe the bug

I've created a local minikube cluster following the documentation https://www.eclipse.org/che/docs/che-7/running-che-locally/, and deployed che to it using chectl. When creating and opening a workspace from any stack, the following log appears until it fails in a timeout.

Successfully assigned che/workspacevljc0zfnpzuh4u76.che-plugin-broker to minikube Pulling image "quay.io/eclipse/che-plugin-metadata-broker:v3.1.0" Successfully pulled image "quay.io/eclipse/che-plugin-metadata-broker:v3.1.0" Created container che-plugin-metadata-broker-v3-1-0 Started container che-plugin-metadata-broker-v3-1-0 Error: Failed to run the workspace: "Plugins installation process timed out"

I've put some further investigation to the "additional context" section. Che version

  • latest
  • nightly
  • other: please specify

Steps to reproduce

  1. Spin up minikube minikube start --vm-driver=kvm2 --memory=6144 --cpus=6 --disk-size=32gb --dns-domain='kube.lab'
  2. Start che chectl server:start --platform minikube --domain "kube.lab"
  3. Put che hostnames in /etc/hosts on workstation
  4. Visit the che server and start a workspace

Expected behavior

The workspace should start and open correctly. Runtime

  • kubernetes (include output of kubectl version)
  • Openshift (include output of oc version)
  • minikube (minikube version: v1.7.2 , kubectl version: v1.17.2)
  • minishift (include output of minishift version and oc version)
  • docker-desktop + K8S (include output of docker version and kubectl version)
  • other: (please specify)

Screenshots Installation method

  • chectl (server:start --platform minikube --domain "kube.lab")
  • che-operator
  • minishift-addon
  • I don't know

Environment

  • my computer
    • Windows
    • Linux
    • macOS
  • Cloud
    • Amazon
    • Azure
    • GCE
    • other (please specify)
  • other: please specify

Additional context

I've tried this on recently set up latest debian buster and latest manjaro linux (amd64). Helm, chectl, minikube and system apckages are all up to date. Host firewall is turned off for testing. KVM is properly running and configured. Changing the domain, or leaving it on default has no effect on the current behavior.

The kubernetes dashboard shows all PODs/services to be running and healthy, but when starting a workspace in che, the workspace POD fails with following log.

2020/02/12 11:50:31 Broker configuration 2020/02/12 11:50:31 Push endpoint: ws://che-che.kube.lab/api/websocket 2020/02/12 11:50:31 Auth enabled: false 2020/02/12 11:50:31 Runtime ID: 2020/02/12 11:50:31 Workspace: workspacevljc0zfnpzuh4u76 2020/02/12 11:50:31 Environment: default 2020/02/12 11:50:31 OwnerId: che 2020/02/12 11:50:31 Couldn't connect to endpoint 'ws://che-che.kube.lab/api/websocket', due to error 'dial tcp: lookup che-che.kube.lab on 10.96.0.10:53: no such host'

The kube-dns service is running, and mapped to two coredns PODs. I've created a dnsutils POD in namespace che to investigate possible DNS issues. From there neither nslookup, nor dig can resolve the che server hostname no matter what domain to query for.

nslookup che-che.kube.lab

Server: 10.96.0.10 Address: 10.96.0.10#53

** server can't find che-che.kube.lab: NXDOMAIN

I don't know whether this is just a configuration issue, or a fault in che, or minikube. As I'm new to kubernetes/che, I don't know where to go from here. Searching the web results in many issues related to older versions of minikube/che and their fixes do not work at this point. Also the che and minikube documentations provide absolutely no hint on this, or even exact configuration steps. So I'm considering the described steps there https://www.eclipse.org/che/docs/che-7/running-che-locally/ as a correct how-to (at least for first steps), but they are simply not working (no rant, just a perception). It would be great if we can find a solution on this, and maybe extend the documentation by all necessary steps.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/eclipse/che/issues/16003?email_source=notifications&email_token=AEYAML7O2WTV42SPX3I5U63RCPTWVA5CNFSM4KTY3PBKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IM5PJEA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYAML537AM4WZLRTRGFNQLRCPTWVANCNFSM4KTY3PBA .

Jarthianur commented 4 years ago

Ok I think I've found what's causing this. There are multiple issues coming up.

First, even that you can specify a DNS domain to both minikube and chectl, the pods' resolv.conf points to "cluster.local". This is listed as known-issue. I'm not forced to run a custom domain for now, so I'll stick to the default one. Maybe the che documentation should be updated with an info on this, and even how to really deploy a custom domain. As far as I've understood, one would have to specify a custom resolv.conf file location at kubernetes deployments, as minikube does not provide this (see multiple open issues there).

The next problem here is that chectl deploys the actual che server as a service named "che-host". But the che config requires it to be named "che-che". As written in the kubernetes DNS docs, a DNS record is added to kube-dns like the service name. Seems obvious, but I first found this out after bypassing kube-dns to my host DNS and hardcoding the name there. Of course this broke the cluster internal DNS, but the host resolution worked then. So I've reverted the DNS settings (minikube DNS indeed works out of the box), and moved the che-host service to che-che. Now nslookup returns

# nslookup che-che
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   che-che.che.svc.cluster.local
Address: 10.97.119.180

which is correct.

Still I cannot create a workspace due to dial tcp: lookup che-che.cluster.local on 10.96.0.10:53: no such host. But the error here seems absolutely legit, as the host "che-che.cluster.local" does not exist, but is named "che-che.che.svc.cluster.local". The actual DNS record is correct according to https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#services, whilst the requested one from the workspace container is not (missing che.svc).

To sum this up

Please let me know, if there is a fault in my investigation. Also it would be great, if someone could tell me where I could apply my suggestions temporarily to test it.

I hope this helps you to resolve the issue, but I think we are on a good way.

Jarthianur commented 4 years ago

Following is the configmap deployed by chectl to my minikube cluster.

{
    "CHE_API": "http://che-che.cluster.local/api",
    "CHE_CORS_ALLOWED__ORIGINS": "*",
    "CHE_CORS_ALLOW__CREDENTIALS": "false",
    "CHE_CORS_ENABLED": "false",
    "CHE_DEBUG_SERVER": "true",
    "CHE_HOST": "che-che.cluster.local",
    "CHE_INFRASTRUCTURE_ACTIVE": "kubernetes",
    "CHE_INFRA_KUBERNETES_BOOTSTRAPPER_BINARY__URL": "http://che-che.cluster.local/agent-binaries/linux_amd64/bootstrapper/bootstrapper",
    "CHE_INFRA_KUBERNETES_INGRESS_ANNOTATIONS__JSON": "{\"kubernetes.io/ingress.class\": \"nginx\", \"nginx.ingress.kubernetes.io/rewrite-target\": \"/$1\",\"nginx.ingress.kubernetes.io/ssl-redirect\": \"false\",\"nginx.ingress.kubernetes.io/proxy-connect-timeout\": \"3600\",\"nginx.ingress.kubernetes.io/proxy-read-timeout\": \"3600\"}",
    "CHE_INFRA_KUBERNETES_INGRESS_DOMAIN": "cluster.local",
    "CHE_INFRA_KUBERNETES_INGRESS_PATH__TRANSFORM": "%s(.*)",
    "CHE_INFRA_KUBERNETES_MASTER__URL": "",
    "CHE_INFRA_KUBERNETES_NAMESPACE": "<username>-che",
    "CHE_INFRA_KUBERNETES_NAMESPACE_DEFAULT": "<username>-che",
    "CHE_INFRA_KUBERNETES_POD_SECURITY__CONTEXT_FS__GROUP": "1724",
    "CHE_INFRA_KUBERNETES_POD_SECURITY__CONTEXT_RUN__AS__USER": "1724",
    "CHE_INFRA_KUBERNETES_PVC_PRECREATE__SUBPATHS": "true",
    "CHE_INFRA_KUBERNETES_PVC_QUANTITY": "1Gi",
    "CHE_INFRA_KUBERNETES_PVC_STORAGE__CLASS__NAME": "",
    "CHE_INFRA_KUBERNETES_PVC_STRATEGY": "common",
    "CHE_INFRA_KUBERNETES_SERVER__STRATEGY": "multi-host",
    "CHE_INFRA_KUBERNETES_SERVICE__ACCOUNT__NAME": "che-workspace",
    "CHE_INFRA_KUBERNETES_TLS__ENABLED": "false",
    "CHE_INFRA_KUBERNETES_TLS__SECRET": "",
    "CHE_INFRA_KUBERNETES_TRUST__CERTS": "false",
    "CHE_INFRA_KUBERNETES_WORKSPACE__START__TIMEOUT__MIN": "15",
    "CHE_LIMITS_WORKSPACE_IDLE_TIMEOUT": "1800000",
    "CHE_LOCAL_CONF_DIR": "/etc/conf",
    "CHE_LOGGER_CONFIG": "",
    "CHE_LOGS_APPENDERS_IMPL": "plaintext",
    "CHE_LOGS_DIR": "/data/logs",
    "CHE_LOG_LEVEL": "INFO",
    "CHE_METRICS_ENABLED": "false",
    "CHE_MULTIUSER": "false",
    "CHE_OAUTH_GITHUB_CLIENTID": "",
    "CHE_OAUTH_GITHUB_CLIENTSECRET": "",
    "CHE_PORT": "8080",
    "CHE_TRACING_ENABLED": "false",
    "CHE_WEBSOCKET_ENDPOINT": "ws://che-che.cluster.local/api/websocket",
    "CHE_WEBSOCKET_ENDPOINT__MINOR": "ws://che-che.cluster.local/api/websocket-minor",
    "CHE_WORKSPACE_AUTO_START": "false",
    "CHE_WORKSPACE_DEVFILE__REGISTRY__URL": "http://devfile-registry-che.cluster.local",
    "CHE_WORKSPACE_HTTPS__PROXY": "",
    "CHE_WORKSPACE_HTTP__PROXY": "",
    "CHE_WORKSPACE_JAVA__OPTIONS": "-Xmx2000m",
    "CHE_WORKSPACE_MAVEN__OPTIONS": "-Xmx20000m",
    "CHE_WORKSPACE_NO__PROXY": "",
    "CHE_WORKSPACE_PLUGIN__REGISTRY__URL": "http://plugin-registry-che.cluster.local/v3",
    "JAEGER_ENDPOINT": "http://jaeger-collector:14268/api/traces",
    "JAEGER_REPORTER_MAX_QUEUE_SIZE": "10000",
    "JAEGER_SAMPLER_MANAGER_HOST_PORT": "jaeger:5778",
    "JAEGER_SAMPLER_PARAM": "1",
    "JAEGER_SAMPLER_TYPE": "const",
    "JAEGER_SERVICE_NAME": "che-server",
    "JAVA_OPTS": "-XX:MaxRAMFraction=2 -XX:+UseParallelGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -Dsun.zip.disableMemoryMapping=true -Xms20m "
}

As you can see there, all che URLs do not match the scheme mentioned in my former comment. Also the che server name is set to "che-che" instead of "che-host".

Is this by intent, or has the kubernetes DNS scheme changed recently? If it is by intent, it would be great to get an explanation on it.

It seems, that I cannot add static entries to kube-dns, so I will try to change the configmap and see what happens...

Jarthianur commented 4 years ago

Changing the configmap fixed the host resolution errors. But I'm still not able to spin up workspaces due to other errors.

The workspace pod log is now

2020/02/14 12:56:20 Broker configuration
2020/02/14 12:56:20   Push endpoint: ws://che-host.che.svc.cluster.local/api/websocket
2020/02/14 12:56:20   Auth enabled: false
2020/02/14 12:56:20   Runtime ID:
2020/02/14 12:56:20     Workspace: workspace1g7ei944640dp8mh
2020/02/14 12:56:20     Environment: default
2020/02/14 12:56:20     OwnerId: che
2020/02/14 12:56:52 Couldn't connect to endpoint 'ws://che-host.che.svc.cluster.local/api/websocket', due to error 'dial tcp 10.105.43.13:80: connect: connection timed out'

while the che server pod log is full of

2020-02-14 12:56:17,761[//10.96.0.1/...]  [ERROR] [k.c.d.i.WatchConnectionManager 268]  - Invalid event type
java.lang.IllegalArgumentException: Pod event timestamp can not be blank
    at org.eclipse.che.workspace.infrastructure.kubernetes.util.PodEvents.convertEventTimestampToDate(PodEvents.java:35)
    at org.eclipse.che.workspace.infrastructure.kubernetes.namespace.KubernetesDeployments$4.happenedAfterWatcherInitialization(KubernetesDeployments.java:570)
    at org.eclipse.che.workspace.infrastructure.kubernetes.namespace.KubernetesDeployments$4.eventReceived(KubernetesDeployments.java:530)
    at org.eclipse.che.workspace.infrastructure.kubernetes.namespace.KubernetesDeployments$4.eventReceived(KubernetesDeployments.java:509)
    at io.fabric8.kubernetes.client.utils.WatcherToggle.eventReceived(WatcherToggle.java:49)
    at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onMessage(WatchConnectionManager.java:232)
    at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:310)
    at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:222)
    at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
    at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:265)
    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:204)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:153)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2020-02-14 12:59:17,838[aceSharedPool-1]  [WARN ] [.i.k.KubernetesInternalRuntime 252]  - Failed to start Kubernetes runtime of workspace workspace1g7ei944640dp8mh. Cause: Plugins installation process timed out
2020-02-14 12:59:17,860[aceSharedPool-1]  [INFO ] [o.e.c.a.w.s.WorkspaceRuntimes 907]   - Workspace 'che:wksp-w35j' with id 'workspace1g7ei944640dp8mh' start failed

This is kind of similar to https://github.com/eclipse/che/issues/15395, but now the workspace does not start at all.

Jarthianur commented 4 years ago

Resolved it on my own now. The solution to all of this is that you have to specify a different domain to chectl than "cluster.local" and set up an external DNS server to resolve your custom domain to the minikube VM ip address always. This is because che requires external DNS resolution even for cluster internal communication.

Please extend the docs on this. That one, or two sentences wouldn't do any harm...

tolusha commented 4 years ago

@Jarthianur Thank you, we will do.

tolusha commented 4 years ago

The following scenario works without setting up external dns server:

  1. minikube start --vm-driver=virtualbox --memory=6144 --cpus=6 --disk-size=32gb --dns-domain='kube.lab'
  2. kubectl -n kube-system edit configmap coredns and add cluster.local to the configuration
    apiVersion: v1
    data:
    Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes kube.lab cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
    kind: ConfigMap
  3. chectl server:start --platform minikube --domain $(minikube ip).nip.io
  4. create a workspace

https://github.com/eclipse/che/issues/14404