clusterlink-net / clusterlink

A Gateway for connecting application services in different domains, networks, and cloud infrastructures
https://clusterlink.net
Other
17 stars 18 forks source link

[BUG]: clusterlink dataplane and control plane deployments are not started #632

Closed huang195 closed 4 weeks ago

huang195 commented 1 month ago

Describe the bug Following the iperf tutorial, after running this command:

clusterlink deploy peer --name client --ingress=LoadBalancer --ingress-port=30443

I am expecting the clusterlink dataplane and controlplane pods will be started from the creation of this CR:

$ oc -n clusterlink-operator get instances.clusterlink.net -o yaml
apiVersion: v1
items:
- apiVersion: clusterlink.net/v1alpha1
  kind: Instance
  metadata:
    creationTimestamp: "2024-06-03T20:16:53Z"
    generation: 1
    labels:
      app.kubernetes.io/created-by: clusterlink
      app.kubernetes.io/instance: cl-instance
      app.kubernetes.io/name: instance
      app.kubernetes.io/part-of: clusterlink
    name: cl-instance
    namespace: clusterlink-operator
    resourceVersion: "10162883"
    uid: 43884864-30a4-4504-8a12-4581fa27d2a1
  spec:
    containerRegistry: ghcr.io/clusterlink-net/
    dataplane:
      replicas: 1
      type: envoy
    ingress:
      port: 30443
      type: LoadBalancer
    logLevel: info
    namespace: clusterlink-system
    tag: latest
kind: List
metadata:
  resourceVersion: ""

But I don't see anything in this namespace:

$ oc -n clusterlink-system get deployments
No resources found in clusterlink-system namespace.
huang195 commented 1 month ago

To keep things in sync here, I had a call with @kfirtoledo offline, and it looks like manually installing the operator from github repo works, but somehow using the clusterlink CLI doesn't start the operator pod. Once the operator pod starts, clusterlink deploy peer had problems starting the control plane and data plane pods, each had a different problem. Control plane was listening on a default port lower than 1024, so on a more strict platform like OCP, it doesn't have enough privilege, and after Kfir changed it to a higher port number, that problem was solve. The data plane pod, on the other hand, had a problem accessing a cert via a K8s secret. It wasn't entirely clear to me what was the root cause.