apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
2.06k stars 1.1k forks source link

Kubernetes Service: Docs do not describe Kubernetes service network requirements #4146

Closed ravening closed 3 years ago

ravening commented 4 years ago
ISSUE TYPE
COMPONENT NAME
kubernetes service
CLOUDSTACK VERSION
4.14.0.0
CONFIGURATION

Advanced zone

OS / ENVIRONMENT

Ubuntu 18.04 OS on KVM hypervisor

SUMMARY

Unable to create new Kubernetes cluster

I was able to register CoreOS, I was able to add new supported kubernetes version but unable to create a cluster as its throwing exception

2020-06-15 13:14:45,149 WARN  [c.c.k.c.u.KubernetesClusterUtil] (API-Job-Executor-2:ctx-c79b5580 job-25 ctx-1ad489f7) (logid:6e496de8) API endpoint for Kubernetes cluster ID: fe55039f-756b-4a30-968d-aa6dae08aaeb not available
javax.net.ssl.SSLHandshakeException: Remote host terminated the handshake
        at java.base/sun.security.ssl.SSLSocketImpl.handleEOF(SSLSocketImpl.java:1313)
        at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1152)
        at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1055)
        at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:395)
        at java.base/sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:567)
        at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1515)
        at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:250)
        at java.base/java.net.URL.openStream(URL.java:1140)
        at org.apache.commons.io.IOUtils.toString(IOUtils.java:1198)
        at com.cloud.kubernetes.cluster.utils.KubernetesClusterUtil.isKubernetesClusterServerRunning(KubernetesClusterUtil.java:221)
        at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterStartWorker.startKubernetesClusterOnCreate(KubernetesClusterStartWorker.java:542)
        at com.cloud.kubernetes.cluster.KubernetesClusterManagerImpl.startKubernetesCluster(KubernetesClusterManagerImpl.java:1057)
        at org.apache.cloudstack.api.command.user.kubernetes.cluster.CreateKubernetesClusterCmd.execute(CreateKubernetesClusterCmd.java:273)
        at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:156)
        at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:108)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:586)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:534)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.io.EOFException: SSL peer shut down incorrectly
        at java.base/sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:167)
        at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:108)
        at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1144)
        ... 27 more
STEPS TO REPRODUCE
1. Enable the kubernetes service
2. Create the iso using
/usr/share/cloudstack-common/scripts/util/create-kubernetes-binaries-iso.sh k8s/ 1.16.3 0.7.1 1.12.0 "https://cloud.weave.works/k8s/net?k8s-version=1.12.5" https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0-beta1/aio/deploy/recommended.yaml
3. Register new kubernetes version using
add kubernetessupportedversion name=1.16 semanticversion=1.16.3 url=http://<server name>/setup-v1.16.3.iso mincpunumber=2 minmemory=2048
4. Wait till its ready
5. Now create a kubernetes cluster from UI with 2 nodes
EXPECTED RESULTS
Expected kubernetes cluster to be created successfully
ACTUAL RESULTS
The cluster status is stuck in "Starting" forever and keeps throwing exception
The vm's are created successfully
weizhouapache commented 4 years ago
$ curl https://10.11.118.126:6443/version
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
$ curl -k https://10.11.118.126:6443/version
{
  "major": "1",
  "minor": "16",
  "gitVersion": "v1.16.10",
  "gitCommit": "f3add640dbcd4f3c33a7749f38baaac0b3fe810d",
  "gitTreeState": "clean",
  "buildDate": "2020-05-20T13:51:56Z",
  "goVersion": "go1.13.9",
  "compiler": "gc",
  "platform": "linux/amd64"
shwstppr commented 4 years ago

@ravening can you check cluster deployment using a prebuilt community Kubernetes version ISO, https://download.cloudstack.org/cks/ while creating the ISO you are using k8s version as 1.16.3 but for weavenet you are passing 1.12.5. Not sure if that is the issue. Also, check if MS is able to SSH your cluster VMs. CKS will need SSH access for communicating with k8s cluster master VM.

ravening commented 4 years ago

@shwstppr the MS cant ssh to master vm in my setup. The MS needs ssh access to all the k8s cluster master vm? which version of weavenet should I pass for 1.16.3?

shwstppr commented 4 years ago

@ravening MS will need SSH access on master node VMs while starting cluster and scaling it. When upgrading cluster it will need SSH access on all VMs. This must be the problem. Deployment would have failed after the time set in cloud.kubernetes.cluster.start.timeout global setting (Default 3600secs or 1hr) It needs to connect to the k8s deployment using kubectl on master vm

Yes, ideally the same version which you have used for k8s. I've not played with different version myself

weizhouapache commented 4 years ago

weavenet

@shwstppr @ravening this issue seems to be caused by certificate. cloudstack generates self-signed certificate for k8s cluster and verify api server (port 6443) via https (see my comment above). I have tried the following steps which work (1) stop k8s cluster (2) download k8s config on UI (3) use base64 to decode the certificate-authority-data in config (4) save the decoded string in /etc/ssl/certs/k8s.cert (5) start the k8s cluster, it is Running

ravening commented 4 years ago

@weizhouapache but customers cant do all those steps right? Everytime we create a cluster then we need to do all these extra steps which is cumbersome and also customers wont like to do all these things

stack1313 commented 4 years ago

Hi, any update to fix this ? I have the same problem. I tested all prebuilt community Kubernetes ISO version and no success. And impossible to download k8s config when the cluster is not deployed. I have the error "Setup is in progress for Kubernetes cluster ID".

PaulAngus commented 4 years ago

@stack1313 where did your coreos template come from?

rohityadavcloud commented 4 years ago

@shwstppr can you triage this? Thanks.

stack1313 commented 4 years ago

Hi,

I think I found the problem on my installation. The issue comes from the networking offering/existing isolated network.

On cluster creation wizard, first time I decided to deploy on an existing isolated network, with this configuration I have the issue. To fix this I created a new network offering with a new isolated network. and after that the deployment work without issue. The problem not happen when you have network empty on wizard because the wizard deploy new isolated network with the default kubernetes network offering.

Maybe the problem comes from the network offering options (like default allow/deny traffic, dhcp, dns.....)

Thanks,

PaulAngus commented 4 years ago

This sounds like it is more of a documentation bug then.

shwstppr commented 4 years ago

@PaulAngus I think the original issue reported here is different from what @stack1313 is reporting. Maybe @ravening can confirm

ravening commented 4 years ago

I will test what stack1313 mentioned and update the result.

crystech commented 4 years ago

Same issue on the error during cluster creation. To complete the cluster creation

  1. Once the nodes are up
  2. curl -k https://public_ip_of_the_node:6443/version
  3. This will trigger download of the Cert and hence allowed ACS to complete the cluster creation since it is able to access the API endpoint now.

New here so are there any fix for current setup at the moment ?

Management logs return the following error and tentatively I could write a routine script to parse for the following and issue a curl - k to install the cert so cluster creation can be completed.

Server returned HTTP response code: 403 for URL: https://ip_of_the_node:6443/version

ravening commented 4 years ago

@ravening MS will need SSH access on master node VMs while starting cluster and scaling it. When upgrading cluster it will need SSH access on all VMs. This must be the problem. Deployment would have failed after the time set in cloud.kubernetes.cluster.start.timeout global setting (Default 3600secs or 1hr) It needs to connect to the k8s deployment using kubectl on master vm

Yes, ideally the same version which you have used for k8s. I've not played with different version myself

@shwstppr usually in our production environments, mgt server doesnt have ssh access to any of the VM's. So how do I fix the issue in that case?

rohityadavcloud commented 3 years ago

For the current model of implementation we expect mgmt server to be able to ssh/have access to the cks/k8s nodes; the feature cannot be supported with an agent or other approches which may not require mgmt server to ssh to the nodes. I'm removing the 4.15.0.0 milestone as such an approach may require a lot of rework @ravening - also any design docs on a new approach as well as PRs are welcome.

DaanHoogland commented 3 years ago

@shwstppr I think this is now reduced to a documentation issue. Can you check that and the docs, please?

weizhouapache commented 3 years ago

@shwstppr I think this is now reduced to a documentation issue. Can you check that and the docs, please?

@DaanHoogland where is the doc for this issue ?

rohityadavcloud commented 3 years ago

@weizhouapache I see some documentation here, but I'm not sure if it cover all you're looking for http://docs.cloudstack.apache.org/en/latest/plugins/cloudstack-kubernetes-service.html#kubernetes-clusters (this certainly states the type of networks CKS is supported on)

weizhouapache commented 3 years ago

@rhtyd thanks. I tested cloudstack 4.15.0.0 RC3 and this issue still exist in my testing. I followed up the steps on the page you posted.

I tested with official ISO, official coreos template, default network offering for kubernetes cluster. There is error caused by self-signed certificate. see line https://github.com/apache/cloudstack/blob/master/plugins/integrations/kubernetes-service/src/main/java/com/cloud/kubernetes/cluster/utils/KubernetesClusterUtil.java#L223

2020-12-28 20:49:45,589 WARN  [c.c.k.c.u.KubernetesClusterUtil] (API-Job-Executor-11:ctx-81fccb14 job-107 ctx-43764632) (logid:9b4ccbea) API endpoint for Kubernetes cluster : test2 not available
java.io.IOException: Server returned HTTP response code: 403 for URL: https://10.135.122.164:6443/version

this seems to be fixed by "curl -k https://10.135.122.164:6443/version" on mgt server.

shwstppr commented 3 years ago

@weizhouapache We tried to reproduces this in our test env(with and without secured management servers) but can't. Can it be due to the way we provision certificates? I'm not sure how to incorporate using the curl -k call in the code. We set certificates for the k8s cluster here, https://github.com/apache/cloudstack/blob/master/plugins/integrations/kubernetes-service/src/main/java/com/cloud/kubernetes/cluster/actionworkers/KubernetesClusterStartWorker.java#L146-L155 Do you see any changes there? cc @ravening @rhtyd @Pearl1594 Also, when you get this error can you check if kube-apiserver-k8s-master pod shows running at some point in the cluster with kubectl kubectl get pods --all-namespaces

rohityadavcloud commented 3 years ago

@weizhouapache do you suspect the code is not ignoring SSL errors/warnings that may be causing that https://github.com/apache/cloudstack/blob/master/plugins/integrations/kubernetes-service/src/main/java/com/cloud/kubernetes/cluster/utils/KubernetesClusterUtil.java#L223 ? However a 403/forbidden error hints that this is not SSL error. Can you investigate further?

rohityadavcloud commented 3 years ago

@weizhouapache I tested and found that the isolated/network should allow egress (public) internet to work with ISOs from http://download.cloudstack.org/cks - check and see if you're hitting the same. @davidjumani has fixed the issue here https://github.com/apache/cloudstack/pull/4459 but we haven't pushed newer ISOs (which we'll try to update soon).

Commentary/notes: I followed the docs (http://docs.cloudstack.apache.org/en/latest/plugins/cloudstack-kubernetes-service.html) and enabled global settings and set up the CoreOS template, then created a CKS cluster with k8s v1.16.0 (1 worker node + 1 master node with 2GB ram 2vCPUs) on a KVM advanced zone env with shared storage on a pre-created isolated network, I saw the following when you deploy the cluster:

2020-12-31 07:56:45,638 WARN  [c.c.k.c.u.KubernetesClusterUtil] (API-Job-Executor-2:ctx-7051112a job-3394 ctx-c601630a) (logid:581f7bce) API endpoint for Kubernetes cluster : cks1-ry not available
javax.net.ssl.SSLHandshakeException: Remote host terminated the handshake
    at java.base/sun.security.ssl.SSLSocketImpl.handleEOF(SSLSocketImpl.java:1588)
    at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1416)

After the nodes are up and kubeadm is able to initialise them, I see this in logs:

2020-12-31 07:57:15,701 INFO  [c.c.k.c.u.KubernetesClusterUtil] (API-Job-Executor-2:ctx-7051112a job-3394 ctx-c601630a) (logid:581f7bce) Kubernetes cluster : cks1-ry API has been successfully provisioned, {
  "major": "1",
  "minor": "16",
  "gitVersion": "v1.16.0",
  "gitCommit": "2bd9643cee5b3b3a5ecbd3af49d09018f0773c77",
  "gitTreeState": "clean",
  "buildDate": "2019-09-18T14:27:17Z",
  "goVersion": "go1.12.9",
  "compiler": "gc",
  "platform": "linux/amd64"
}

And after some time I see:

2020-12-31 07:59:50,170 DEBUG [c.c.k.c.u.KubernetesClusterUtil] (API-Job-Executor-2:ctx-7051112a job-3394 ctx-c601630a) (logid:581f7bce) Checking ready nodes for the Kubernetes cluster : cks1-ry with total 2 provisioned nodes
2020-12-31 07:59:50,543 DEBUG [c.c.k.c.u.KubernetesClusterUtil] (API-Job-Executor-2:ctx-7051112a job-3394 ctx-c601630a) (logid:581f7bce) Kubernetes cluster : cks1-ry has total 2 provisioned nodes while 0 ready now

This continued for a while, then I debugged to find that the nodes were unable to fetch container images, I saw:

$ sudo ./kubectl get nodes
NAME             STATUS     ROLES    AGE   VERSION
cks1-ry-master   NotReady   master   12m   v1.16.0
cks1-ry-node-1   NotReady   <none>   12m   v1.16.0
$ sudo ./kubectl get pods -n kube-system -o wide
NAME                                     READY   STATUS             RESTARTS   AGE   IP           NODE             NOMINATED NODE   READINESS GATES
coredns-5644d7b6d9-7xcxj                 0/1     Pending            0          13m   <none>       <none>           <none>           <none>
coredns-5644d7b6d9-b8twq                 0/1     Pending            0          13m   <none>       <none>           <none>           <none>
etcd-cks1-ry-master                      1/1     Running            0          12m   10.1.1.150   cks1-ry-master   <none>           <none>
kube-apiserver-cks1-ry-master            1/1     Running            0          12m   10.1.1.150   cks1-ry-master   <none>           <none>
kube-controller-manager-cks1-ry-master   1/1     Running            0          12m   10.1.1.150   cks1-ry-master   <none>           <none>
kube-proxy-bwmhj                         1/1     Running            0          13m   10.1.1.35    cks1-ry-node-1   <none>           <none>
kube-proxy-tkjbp                         1/1     Running            0          13m   10.1.1.150   cks1-ry-master   <none>           <none>
kube-scheduler-cks1-ry-master            1/1     Running            0          12m   10.1.1.150   cks1-ry-master   <none>           <none>
weave-net-g6d9l                          0/2     ImagePullBackOff   0          13m   10.1.1.150   cks1-ry-master   <none>           <none>
weave-net-q4ft5                          0/2     ErrImagePull       0          13m   10.1.1.35    cks1-ry-node-1   <none>           <none>

I checked and added egress allow rules and then manually pulled images which fixed that issue:

docker pull docker.io/weaveworks/weave-kube:2.7.0
docker pull docker.io/weaveworks/weave-npc:2.7.0

After this the cluster came up and I was able to do basic tests using kubectl and use k8s dashboard via proxy.

rohityadavcloud commented 3 years ago

@ravening @weizhouapache cc @shwstppr I've added a line here - https://github.com/apache/cloudstack-documentation/pull/174/files (with newer CKS ISOs that we'll build, we'll bundle additional dependencies (weavenet etc) so they're not fetching during setup)

weizhouapache commented 3 years ago

@rhtyd thanks a lot Rohit. I will test it again.

rohityadavcloud commented 3 years ago

Cert check/SSL issue fixed in https://github.com/apache/cloudstack/pull/4639 Please re-open if we something else was missed (docs?)

vaheed commented 2 years ago

Hello, don't know if my error to relevant to this issue or not.

but I'm using fresh ACS 4.16.1.0 Management and KVM nodes Ubuntu 20.04

when trying to create a k8s cluster master node and compute node running but cluster still starting forever 2022-06-05 04:59:55,946 WARN [o.a.c.f.j.i.AsyncJobMonitor] (Timer-0:ctx-8669d7f6) (logid:933baacc) Task (job-326) has been pending for 1083 seconds 2022-06-05 04:59:59,472 WARN [c.c.k.c.u.KubernetesClusterUtil] (API-Job-Executor-7:ctx-c7dbf788 job-326 ctx-2117db03) (logid:ee321178) API endpoint for Kubernetes cluster : k8s011 not available javax.net.ssl.SSLHandshakeException: Remote host terminated the handshake Caused by: java.io.EOFException: SSL peer shut down incorrectly

and this is the result from ACS management when trying to k8s master node curl -k https://172.21.21.108:6443/version curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to 172.21.21.108:6443

abdelouahabb commented 1 year ago

Worked here : Need to change endpoint.url global variable from http://localhost:8080/client/api to http://MANAGEMENT_SERVER_IP:8080/client/api