clastix / kamaji

Kamaji is the Hosted Control Plane Manager for Kubernetes.
https://kamaji.clastix.io
Apache License 2.0
934 stars 81 forks source link

Nothing happens with servicetype nodeport #432

Closed florent1s closed 2 months ago

florent1s commented 3 months ago

Hello, I observed an unexpected behavior (for me at least ;) ) while playing around with Kamaji (v0.4.1): It seems that when deploying a TenantControlPlane with serviceType set to NodePort no ressources get created (no pods, no service etc.) but as soon as I switch it to ClusterIP or LoadBalancer things kick into gear and all ressources are up within a few seconds.

I used the TenantControlPlane YAML file shown in the getting started guide. Not sure if this is a bug, bad config on my part or if NodePort is not supported anymore (the getting started guide mentions it as a viable option).

If there is some more information that could help here I'd gladly try to provide it.

Have a nice day !

prometherion commented 3 months ago

Hey @florent1s, thanks for opening your first issue on Kamaji.

We need to document better how to use Kamaji with NodePort services, mostly because we gave for granted too many things. When using NodePort, any Service can be reached using any available Node address by targeting the selected port. This is very convenient with web applications, but trickier with Kubernetes since we always need an advertised address that will populate the EndpointSlice for the kubernetes.default.svc service.

Kamaji has a very limited RBAC applied for the management cluster, and we don't know the node IP where the instance of the Tenant Control Plane is running on, so a prerequisite for this kind of Service exposure is setting manually a VIP in the /spec/networkProfile/address field: it could be a VIP targeting all the nodes, or a node which will never leave the cluster.

I need to test if we can take advantage of the Downward APIs to inject the running node IP as an Advertised address, but I suspect there will be an egg/chicken problem with the Kamaji handlers which are expecting a valid IP to generate certificates prior to the Pod execution.

florent1s commented 3 months ago

Hey, thanks for your reply, should have spent more time looking trough the API reference as that spec.networkProfile.address field is actually what I needed for my setup.

I tried some things out but no matter what IP I gave the TCP as long as serviceType was set to NodePort it would not enter the provisioning state. I tried using a node's internal IP as well as the IP of a service targeting all the nodes but maybe I missed something there.

I ultimately got my setup working by creating a nodeport manually and settings serviceType to LoadBalancer (which remains pending as my cluster has no LB available but that's not a problem as I only need it to allow the resource provisioning to take place so ClusterIP would certainly work aswell), I then gave the TCP (under spec.networkProfile) the address of one of the nodes and the port defined in the nodeport.

I don't think this is the type of configuration you envisioned but it got things working as desired and given that this setup is only a POC it is not an issue if it's a bit hacky for now ;). I will look into using an ingress as well, it would probably do the trick too ...

prometherion commented 3 months ago

I guess the problem you stumbled upon is related to the Kubernetes port you can configure in tcp.spec.networkProfile.port which has to match your node port range (typically, 30000–32767): if you're using 6443 then it will fail since Kamaji is not doing a port address translation to simplify the network.

Not sure if this is the case, you didn't share any manifest, neither Kamaji logs that would help in debugging your situation, although feeling confident this is the case.

Ingress exposure is even more trickier, mostly if you plan to let join worker nodes since kubernetes.default.svc endpoint requires an IPv4 value, not a FQDN offered by an Ingress.

My suggestion would be to set up a local Load Balancer, even a simple Metal LB that just works, and give it a try to Kamaji by reading the logs which offer a minimum set of information.

From our side, we can expand the documentation with some tutorials and not give for granted Kubernetes internals and best practices, it's something we have in the pipeline although it takes time: happy if you could contribute to the documentation by sharing your learnings.

florent1s commented 3 months ago

This is what my deployment looks like, I made sure to use de ports defined in the nodeport. But it is still only when I change serviceType to ClusterIP or Loadbalancer that the pods get running (and I then can make use of the nodeport to interact/connect to the control plane as it is reachable at x.x.x.43:30100 as expected). Do you see any flaws in the TCP or somewhere else ?

I should indeed have some interesting things to share in a few weeks time but there are still some hurdles that need to be overcome until then ;)

apiVersion: v1
kind: Namespace
metadata:
  labels:
    kubernetes.io/metadata.name: test
  name: test
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

---

apiVersion: kamaji.clastix.io/v1alpha1
kind: TenantControlPlane
metadata:
  name: tenant-02
  namespace: test
  labels:
    tenant.clastix.io: tenant-02
spec:
  dataStore: default
  controlPlane:
    deployment:
      replicas: 1
      additionalMetadata:
        labels:
          tenant.clastix.io: tenant-02
      extraArgs:
        apiServer: []
        controllerManager: []
        scheduler: []
      resources:
        apiServer:
          requests:
            cpu: 250m
            memory: 512Mi
          limits: {}
        controllerManager:
          requests:
            cpu: 125m
            memory: 256Mi
          limits: {}
        scheduler:
          requests:
            cpu: 125m
            memory: 256Mi
          limits: {}
    service:
      additionalMetadata:
        labels:
          tenant.clastix.io: tenant-02
      serviceType: NodePort
  kubernetes:
    version: v1.29.1
    kubelet:
      cgroupfs: systemd
    admissionControllers:
      - ResourceQuota
      - LimitRanger
  networkProfile:
    address: x.x.x.43 # a worker node
    port: 30100
    certSANs:
    - tenant-02.clastix.labs
    - x.x.x.43
    serviceCidr: 10.96.0.0/16
    podCidr: 10.36.0.0/16
    dnsServiceIPs:
    - 10.96.0.10
  addons:
    coreDNS: {}
    kubeProxy: {}
    konnectivity:
      server:
        port: 30101
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits: {}

---

kind: Service
apiVersion: v1
metadata:
  name: tenant-02-nodeport
  namespace: test
spec:
  selector:
    kamaji.clastix.io/name: tenant-02 
  ports:
  - name: kube-apiserver
    nodePort: 30100
    port: 30100
    protocol: TCP
    targetPort: 30100
  - name: konnectivity-server
    nodePort: 30101
    port: 30101
    protocol: TCP
    targetPort: 30101
  type: NodePort
prometherion commented 2 months ago

I tried your manifest and everything worked as expected.

$: kubectl get tenantcontrolplanes.kamaji.clastix.io
NAME        VERSION   STATUS   CONTROL-PLANE ENDPOINT   KUBECONFIG                   DATASTORE   AGE
tenant-02   v1.29.1   Ready    172.18.0.2:30100         tenant-02-admin-kubeconfig   default     106s

As I remarked previously:

Not sure if this is the case, you didn't share any manifest, neither Kamaji logs that would help in debugging your situation, although feeling confident this is the case.

My suggestion would be to set up a local Load Balancer, even a simple Metal LB that just works, and give it a try to Kamaji by reading the logs which offer a minimum set of information.

If you need support we have the Slack workspace channel, the Issue section is mostly used to track down confirmed bugs and not for giving support.

Please, open it back in case there's an effective bug.

florent1s commented 2 months ago

I did a bit of extra testing and I found that the file I shared above actually works as long as the service name is the same as the TCP's name (tenant-01 in this case) and it also works when not specifying the nodeport service at all and letting Kamaji create it instead as long as the ports specified for API server and Konnectivity server are in the valid range for a nodeport.

And I'm also pretty sure I now undestand how it came to this mess: If I remember correctly I first tried to create a TCP with service type nodeport but left the default ports (6443 and 8132 I think) which are not valid for a nodeport and so no service got created and the TCP couldn't start any pods. Seeing this I then decided to create my own nodeport which forced me to use valid ports but because I chose the wrong name for said service the TCP didn't recognize it and also couldn't create its own service as the ports where already in use by that wrongly named service...

Rookie mistakes I guess but I had just about 2 months experience with k8s at that point so that was bound to happen :)

Anyway thanks for your help and advice and have a great day !