Closed Jarthianur closed 4 years ago
Hi Jarthianur,
Plz try to inspect err follow this guide:
https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
On Wed, Feb 12, 2020, 19:30 Jarthianur notifications@github.com wrote:
Describe the bug
I've created a local minikube cluster following the documentation https://www.eclipse.org/che/docs/che-7/running-che-locally/, and deployed che to it using chectl. When creating and opening a workspace from any stack, the following log appears until it fails in a timeout.
Successfully assigned che/workspacevljc0zfnpzuh4u76.che-plugin-broker to minikube Pulling image "quay.io/eclipse/che-plugin-metadata-broker:v3.1.0" Successfully pulled image "quay.io/eclipse/che-plugin-metadata-broker:v3.1.0" Created container che-plugin-metadata-broker-v3-1-0 Started container che-plugin-metadata-broker-v3-1-0 Error: Failed to run the workspace: "Plugins installation process timed out"
I've put some further investigation to the "additional context" section. Che version
- latest
- nightly
- other: please specify
Steps to reproduce
- Spin up minikube minikube start --vm-driver=kvm2 --memory=6144 --cpus=6 --disk-size=32gb --dns-domain='kube.lab'
- Start che chectl server:start --platform minikube --domain "kube.lab"
- Put che hostnames in /etc/hosts on workstation
- Visit the che server and start a workspace
Expected behavior
The workspace should start and open correctly. Runtime
- kubernetes (include output of kubectl version)
- Openshift (include output of oc version)
- minikube (minikube version: v1.7.2 , kubectl version: v1.17.2)
- minishift (include output of minishift version and oc version)
- docker-desktop + K8S (include output of docker version and kubectl version)
- other: (please specify)
Screenshots Installation method
- chectl (server:start --platform minikube --domain "kube.lab")
- che-operator
- minishift-addon
- I don't know
Environment
- my computer
- Windows
- Linux
- macOS
- Cloud
- Amazon
- Azure
- GCE
- other (please specify)
- other: please specify
Additional context
I've tried this on recently set up latest debian buster and latest manjaro linux (amd64). Helm, chectl, minikube and system apckages are all up to date. Host firewall is turned off for testing. KVM is properly running and configured. Changing the domain, or leaving it on default has no effect on the current behavior.
The kubernetes dashboard shows all PODs/services to be running and healthy, but when starting a workspace in che, the workspace POD fails with following log.
2020/02/12 11:50:31 Broker configuration 2020/02/12 11:50:31 Push endpoint: ws://che-che.kube.lab/api/websocket 2020/02/12 11:50:31 Auth enabled: false 2020/02/12 11:50:31 Runtime ID: 2020/02/12 11:50:31 Workspace: workspacevljc0zfnpzuh4u76 2020/02/12 11:50:31 Environment: default 2020/02/12 11:50:31 OwnerId: che 2020/02/12 11:50:31 Couldn't connect to endpoint 'ws://che-che.kube.lab/api/websocket', due to error 'dial tcp: lookup che-che.kube.lab on 10.96.0.10:53: no such host'
The kube-dns service is running, and mapped to two coredns PODs. I've created a dnsutils POD in namespace che to investigate possible DNS issues. From there neither nslookup, nor dig can resolve the che server hostname no matter what domain to query for.
nslookup che-che.kube.lab
Server: 10.96.0.10 Address: 10.96.0.10#53
** server can't find che-che.kube.lab: NXDOMAIN
I don't know whether this is just a configuration issue, or a fault in che, or minikube. As I'm new to kubernetes/che, I don't know where to go from here. Searching the web results in many issues related to older versions of minikube/che and their fixes do not work at this point. Also the che and minikube documentations provide absolutely no hint on this, or even exact configuration steps. So I'm considering the described steps there https://www.eclipse.org/che/docs/che-7/running-che-locally/ as a correct how-to (at least for first steps), but they are simply not working (no rant, just a perception). It would be great if we can find a solution on this, and maybe extend the documentation by all necessary steps.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/eclipse/che/issues/16003?email_source=notifications&email_token=AEYAML7O2WTV42SPX3I5U63RCPTWVA5CNFSM4KTY3PBKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IM5PJEA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYAML537AM4WZLRTRGFNQLRCPTWVANCNFSM4KTY3PBA .
Ok I think I've found what's causing this. There are multiple issues coming up.
First, even that you can specify a DNS domain to both minikube and chectl, the pods' resolv.conf points to "cluster.local". This is listed as known-issue. I'm not forced to run a custom domain for now, so I'll stick to the default one. Maybe the che documentation should be updated with an info on this, and even how to really deploy a custom domain. As far as I've understood, one would have to specify a custom resolv.conf file location at kubernetes deployments, as minikube does not provide this (see multiple open issues there).
The next problem here is that chectl deploys the actual che server as a service named "che-host". But the che config requires it to be named "che-che". As written in the kubernetes DNS docs, a DNS record is added to kube-dns like the service name. Seems obvious, but I first found this out after bypassing kube-dns to my host DNS and hardcoding the name there. Of course this broke the cluster internal DNS, but the host resolution worked then. So I've reverted the DNS settings (minikube DNS indeed works out of the box), and moved the che-host service to che-che. Now nslookup returns
# nslookup che-che
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: che-che.che.svc.cluster.local
Address: 10.97.119.180
which is correct.
Still I cannot create a workspace due to dial tcp: lookup che-che.cluster.local on 10.96.0.10:53: no such host
. But the error here seems absolutely legit, as the host "che-che.cluster.local" does not exist, but is named "che-che.che.svc.cluster.local". The actual DNS record is correct according to https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#services, whilst the requested one from the workspace container is not (missing che.svc).
To sum this up
Please let me know, if there is a fault in my investigation. Also it would be great, if someone could tell me where I could apply my suggestions temporarily to test it.
I hope this helps you to resolve the issue, but I think we are on a good way.
Following is the configmap deployed by chectl to my minikube cluster.
{
"CHE_API": "http://che-che.cluster.local/api",
"CHE_CORS_ALLOWED__ORIGINS": "*",
"CHE_CORS_ALLOW__CREDENTIALS": "false",
"CHE_CORS_ENABLED": "false",
"CHE_DEBUG_SERVER": "true",
"CHE_HOST": "che-che.cluster.local",
"CHE_INFRASTRUCTURE_ACTIVE": "kubernetes",
"CHE_INFRA_KUBERNETES_BOOTSTRAPPER_BINARY__URL": "http://che-che.cluster.local/agent-binaries/linux_amd64/bootstrapper/bootstrapper",
"CHE_INFRA_KUBERNETES_INGRESS_ANNOTATIONS__JSON": "{\"kubernetes.io/ingress.class\": \"nginx\", \"nginx.ingress.kubernetes.io/rewrite-target\": \"/$1\",\"nginx.ingress.kubernetes.io/ssl-redirect\": \"false\",\"nginx.ingress.kubernetes.io/proxy-connect-timeout\": \"3600\",\"nginx.ingress.kubernetes.io/proxy-read-timeout\": \"3600\"}",
"CHE_INFRA_KUBERNETES_INGRESS_DOMAIN": "cluster.local",
"CHE_INFRA_KUBERNETES_INGRESS_PATH__TRANSFORM": "%s(.*)",
"CHE_INFRA_KUBERNETES_MASTER__URL": "",
"CHE_INFRA_KUBERNETES_NAMESPACE": "<username>-che",
"CHE_INFRA_KUBERNETES_NAMESPACE_DEFAULT": "<username>-che",
"CHE_INFRA_KUBERNETES_POD_SECURITY__CONTEXT_FS__GROUP": "1724",
"CHE_INFRA_KUBERNETES_POD_SECURITY__CONTEXT_RUN__AS__USER": "1724",
"CHE_INFRA_KUBERNETES_PVC_PRECREATE__SUBPATHS": "true",
"CHE_INFRA_KUBERNETES_PVC_QUANTITY": "1Gi",
"CHE_INFRA_KUBERNETES_PVC_STORAGE__CLASS__NAME": "",
"CHE_INFRA_KUBERNETES_PVC_STRATEGY": "common",
"CHE_INFRA_KUBERNETES_SERVER__STRATEGY": "multi-host",
"CHE_INFRA_KUBERNETES_SERVICE__ACCOUNT__NAME": "che-workspace",
"CHE_INFRA_KUBERNETES_TLS__ENABLED": "false",
"CHE_INFRA_KUBERNETES_TLS__SECRET": "",
"CHE_INFRA_KUBERNETES_TRUST__CERTS": "false",
"CHE_INFRA_KUBERNETES_WORKSPACE__START__TIMEOUT__MIN": "15",
"CHE_LIMITS_WORKSPACE_IDLE_TIMEOUT": "1800000",
"CHE_LOCAL_CONF_DIR": "/etc/conf",
"CHE_LOGGER_CONFIG": "",
"CHE_LOGS_APPENDERS_IMPL": "plaintext",
"CHE_LOGS_DIR": "/data/logs",
"CHE_LOG_LEVEL": "INFO",
"CHE_METRICS_ENABLED": "false",
"CHE_MULTIUSER": "false",
"CHE_OAUTH_GITHUB_CLIENTID": "",
"CHE_OAUTH_GITHUB_CLIENTSECRET": "",
"CHE_PORT": "8080",
"CHE_TRACING_ENABLED": "false",
"CHE_WEBSOCKET_ENDPOINT": "ws://che-che.cluster.local/api/websocket",
"CHE_WEBSOCKET_ENDPOINT__MINOR": "ws://che-che.cluster.local/api/websocket-minor",
"CHE_WORKSPACE_AUTO_START": "false",
"CHE_WORKSPACE_DEVFILE__REGISTRY__URL": "http://devfile-registry-che.cluster.local",
"CHE_WORKSPACE_HTTPS__PROXY": "",
"CHE_WORKSPACE_HTTP__PROXY": "",
"CHE_WORKSPACE_JAVA__OPTIONS": "-Xmx2000m",
"CHE_WORKSPACE_MAVEN__OPTIONS": "-Xmx20000m",
"CHE_WORKSPACE_NO__PROXY": "",
"CHE_WORKSPACE_PLUGIN__REGISTRY__URL": "http://plugin-registry-che.cluster.local/v3",
"JAEGER_ENDPOINT": "http://jaeger-collector:14268/api/traces",
"JAEGER_REPORTER_MAX_QUEUE_SIZE": "10000",
"JAEGER_SAMPLER_MANAGER_HOST_PORT": "jaeger:5778",
"JAEGER_SAMPLER_PARAM": "1",
"JAEGER_SAMPLER_TYPE": "const",
"JAEGER_SERVICE_NAME": "che-server",
"JAVA_OPTS": "-XX:MaxRAMFraction=2 -XX:+UseParallelGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -Dsun.zip.disableMemoryMapping=true -Xms20m "
}
As you can see there, all che URLs do not match the scheme mentioned in my former comment. Also the che server name is set to "che-che" instead of "che-host".
Is this by intent, or has the kubernetes DNS scheme changed recently? If it is by intent, it would be great to get an explanation on it.
It seems, that I cannot add static entries to kube-dns, so I will try to change the configmap and see what happens...
Changing the configmap fixed the host resolution errors. But I'm still not able to spin up workspaces due to other errors.
The workspace pod log is now
2020/02/14 12:56:20 Broker configuration
2020/02/14 12:56:20 Push endpoint: ws://che-host.che.svc.cluster.local/api/websocket
2020/02/14 12:56:20 Auth enabled: false
2020/02/14 12:56:20 Runtime ID:
2020/02/14 12:56:20 Workspace: workspace1g7ei944640dp8mh
2020/02/14 12:56:20 Environment: default
2020/02/14 12:56:20 OwnerId: che
2020/02/14 12:56:52 Couldn't connect to endpoint 'ws://che-host.che.svc.cluster.local/api/websocket', due to error 'dial tcp 10.105.43.13:80: connect: connection timed out'
while the che server pod log is full of
2020-02-14 12:56:17,761[//10.96.0.1/...] [ERROR] [k.c.d.i.WatchConnectionManager 268] - Invalid event type
java.lang.IllegalArgumentException: Pod event timestamp can not be blank
at org.eclipse.che.workspace.infrastructure.kubernetes.util.PodEvents.convertEventTimestampToDate(PodEvents.java:35)
at org.eclipse.che.workspace.infrastructure.kubernetes.namespace.KubernetesDeployments$4.happenedAfterWatcherInitialization(KubernetesDeployments.java:570)
at org.eclipse.che.workspace.infrastructure.kubernetes.namespace.KubernetesDeployments$4.eventReceived(KubernetesDeployments.java:530)
at org.eclipse.che.workspace.infrastructure.kubernetes.namespace.KubernetesDeployments$4.eventReceived(KubernetesDeployments.java:509)
at io.fabric8.kubernetes.client.utils.WatcherToggle.eventReceived(WatcherToggle.java:49)
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onMessage(WatchConnectionManager.java:232)
at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:310)
at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:222)
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:265)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:204)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:153)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2020-02-14 12:59:17,838[aceSharedPool-1] [WARN ] [.i.k.KubernetesInternalRuntime 252] - Failed to start Kubernetes runtime of workspace workspace1g7ei944640dp8mh. Cause: Plugins installation process timed out
2020-02-14 12:59:17,860[aceSharedPool-1] [INFO ] [o.e.c.a.w.s.WorkspaceRuntimes 907] - Workspace 'che:wksp-w35j' with id 'workspace1g7ei944640dp8mh' start failed
This is kind of similar to https://github.com/eclipse/che/issues/15395, but now the workspace does not start at all.
Resolved it on my own now. The solution to all of this is that you have to specify a different domain to chectl than "cluster.local" and set up an external DNS server to resolve your custom domain to the minikube VM ip address always. This is because che requires external DNS resolution even for cluster internal communication.
Please extend the docs on this. That one, or two sentences wouldn't do any harm...
@Jarthianur Thank you, we will do.
The following scenario works without setting up external dns server:
cluster.local
to the configuration
apiVersion: v1
data:
Corefile: |
.:53 {
errors
health
ready
kubernetes kube.lab cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
kind: ConfigMap
Describe the bug
I've created a local minikube cluster following the documentation, and deployed che to it using
chectl
. When creating and opening a workspace from any stack, the following log appears until it fails in a timeout.I've put some further investigation to the "additional context" section.
Che version
Steps to reproduce
minikube start --vm-driver=kvm2 --memory=6144 --cpus=6 --disk-size=32gb --dns-domain='kube.lab'
chectl server:start --platform minikube --domain "kube.lab"
/etc/hosts
on workstationExpected behavior
The workspace should start and open correctly.
Runtime
kubectl version
)oc version
)minishift version
andoc version
)docker version
andkubectl version
)Screenshots
Installation method
server:start --platform minikube --domain "kube.lab"
)Environment
Additional context
I've tried this on recently set up latest debian buster and latest manjaro linux (amd64). Helm, chectl, minikube and system apckages are all up to date. Host firewall is turned off for testing. KVM is properly running and configured. Changing the domain, or leaving it on default has no effect on the current behavior.
The kubernetes dashboard shows all PODs/services to be running and healthy, but when starting a workspace in che, the workspace POD fails with following log.
The kube-dns service is running, and mapped to two coredns PODs. I've created a dnsutils POD in namespace che to investigate possible DNS issues. From there neither nslookup, nor dig can resolve the che server hostname no matter what domain to query for.
I don't know whether this is just a configuration issue, or a fault in che, or minikube. As I'm new to kubernetes/che, I don't know where to go from here. Searching the web results in many issues related to older versions of minikube/che and their fixes do not work at this point. Also the che and minikube documentations provide absolutely no hint on this, or even exact configuration steps. So I'm considering the described steps there as a correct how-to (at least for first steps), but they are simply not working (no rant, just a perception). It would be great if we can find a solution on this, and maybe extend the documentation by all necessary steps.