apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
2.07k stars 1.1k forks source link

Project User kubeadmin is not working with project k8s clusters #6987

Closed aron-ac closed 1 year ago

aron-ac commented 1 year ago
ISSUE TYPE
COMPONENT NAME
Kubernetes Kubeadmin
CLOUDSTACK VERSION
4.17.1.0
CONFIGURATION
OS / ENVIRONMENT

kubernetes 1.24

SUMMARY

It appears https://github.com/apache/cloudstack/issues/6344 started to address the issue of project based users and kubernetes, but looking at the fix it looks to have only been applied for auto scaling perhaps.

inside of a project as a domain admin, kubeadmin does not work from inside the k8s cluster kube.conf using kubectl. I cannot acquire a new IP address for an ingress controller and presumably cannot complete any tasks as kubeadmin because looking at cloudstack events, kubeadmin is never called

STEPS TO REPRODUCE

i tested this by creating a k8s cluster as an admin account and from inside the k8s cluster creating an nginx ingress controller. there was no issue:

% kubectl --namespace default get services -o wide -w nginx-ingress-ingress-nginx-controller
NAME                                     TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE   SELECTOR
nginx-ingress-ingress-nginx-controller   LoadBalancer   10.100.126.176   XX.XX.XX.XX   80:32279/TCP,443:30353/TCP   25s   app.kubernetes.io/component=controller,app.kubernetes.io/instance=nginx-ingress,app.kubernetes.io/name=ingress-nginx

then i created a project and created a domain admin user for that project and deployed another k8s cluster and attempted to deploy an nginx ingress controller and a traefik ingress but the external IP stayed in a pending state:

% kubectl --namespace default get services -o wide -w nginx-ingress-ingress-nginx-controller
NAME                                     TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE     SELECTOR
nginx-ingress-ingress-nginx-controller   LoadBalancer   10.98.170.156   <pending>     80:32737/TCP,443:32628/TCP   5m36s   app.kubernetes.io/component=controller,app.kubernetes.io/instance=nginx-ingress,app.kubernetes.io/name=ingress-nginx

looking in the cloudstack events i noticed that with the normal account that kubeadmin successfully acquired the new pub ip for a load balancer in cloudstack (nginx ingress in k8s). but in the project account domain admin, kubeadmin was never recognized.

EXPECTED RESULTS

As a project user deploying a k8s cluster I should still be able to use kubectl and access cloudstack kubeadmin

ac-demo % helm install nginx-ingress ingress-nginx/ingress-nginx --set controller.publishService.enabled=true
NAME: nginx-ingress
LAST DEPLOYED: Tue Oct 25 20:40:16 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The ingress-nginx controller has been installed.
It may take a few minutes for the LoadBalancer IP to be available.

then

% kubectl --namespace default get services -o wide -w nginx-ingress-ingress-nginx-controller
NAME                                     TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE   SELECTOR
nginx-ingress-ingress-nginx-controller   LoadBalancer   10.100.126.176   XX.XX.XX.XX   80:32279/TCP,443:30353/TCP   25s   app.kubernetes.io/component=controller,app.kubernetes.io/instance=nginx-ingress,app.kubernetes.io/name=ingress-nginx
ACTUAL RESULTS
ac-demo % helm install nginx-ingress ingress-nginx/ingress-nginx --set controller.publishService.enabled=true
NAME: nginx-ingress
LAST DEPLOYED: Tue Oct 25 20:40:16 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The ingress-nginx controller has been installed.
It may take a few minutes for the LoadBalancer IP to be available.

then

% kubectl --namespace default get services -o wide -w nginx-ingress-ingress-nginx-controller
NAME                                     TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE     SELECTOR
nginx-ingress-ingress-nginx-controller   LoadBalancer   10.98.170.156   <pending>     80:32737/TCP,443:32628/TCP   5m36s   app.kubernetes.io/component=controller,app.kubernetes.io/instance=nginx-ingress,app.kubernetes.io/name=ingress-nginx
shwstppr commented 1 year ago

@aron-ac Hi, can you please simplify the reproduction steps maybe in the form of a list

aron-ac commented 1 year ago

@aron-ac Hi, can you please simplify the reproduction steps maybe in the form of a list

@shwstppr sure

in cloudstack 4.17.1.0

aron-ac commented 1 year ago

@shwstppr have you been able to check back in on this issue?

shwstppr commented 1 year ago

@davidjumani can you please comment on this? For ACS to provision resources based on k8s deployments won't we need kubernetes-provider being setup or is being done by default now?

davidjumani commented 1 year ago

@aron-ac Can you please provide the ingress-nginx logs ?

aron-ac commented 1 year ago

@davidjumani sure see below, I don't think this is an nginx issue though, as the same problem occurs with a traefik ingress controller, and auto scale doesn't work properly. im fairly confident this is an issue with kubeadmin for projects

kubectl logs nginx-ingress-ingress-nginx-controller-5b8c45b6f6-5lx8g -n default      
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.5.1
  Build:         d003aae913cc25f375deb74f898c7f3c65c06f05
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.21.6

-------------------------------------------------------------------------------

W1223 15:49:48.515225       8 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I1223 15:49:48.515472       8 main.go:209] "Creating API client" host="https://10.96.0.1:443"
I1223 15:49:48.563849       8 main.go:253] "Running in Kubernetes cluster" major="1" minor="24" git="v1.24.0" state="clean" commit="4ce5a8954017644c5420bae81d72b09b735c21f0" platform="linux/amd64"
I1223 15:49:48.875095       8 main.go:104] "SSL fake certificate created" file="/etc/ingress-controller/ssl/default-fake-certificate.pem"
I1223 15:49:48.930114       8 ssl.go:533] "loading tls certificate" path="/usr/local/certificates/cert" key="/usr/local/certificates/key"
I1223 15:49:49.016798       8 nginx.go:260] "Starting NGINX Ingress controller"
I1223 15:49:49.049958       8 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"default", Name:"nginx-ingress-ingress-nginx-controller", UID:"5fc64c71-e757-4636-a641-b3b5a6cd872e", APIVersion:"v1", ResourceVersion:"985", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap default/nginx-ingress-ingress-nginx-controller
I1223 15:49:50.219486       8 nginx.go:303] "Starting NGINX process"
I1223 15:49:50.220052       8 leaderelection.go:248] attempting to acquire leader lease default/nginx-ingress-ingress-nginx-leader...
I1223 15:49:50.221934       8 nginx.go:323] "Starting validation webhook" address=":8443" certPath="/usr/local/certificates/cert" keyPath="/usr/local/certificates/key"
I1223 15:49:50.222420       8 controller.go:168] "Configuration changes detected, backend reload required"
I1223 15:49:50.243757       8 leaderelection.go:258] successfully acquired lease default/nginx-ingress-ingress-nginx-leader
I1223 15:49:50.244193       8 status.go:84] "New leader elected" identity="nginx-ingress-ingress-nginx-controller-5b8c45b6f6-5lx8g"
I1223 15:49:50.340986       8 controller.go:185] "Backend successfully reloaded"
I1223 15:49:50.341116       8 controller.go:196] "Initial sync, sleeping for 1 second"
I1223 15:49:50.341630       8 event.go:285] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"nginx-ingress-ingress-nginx-controller-5b8c45b6f6-5lx8g", UID:"6d660be6-c951-4420-a534-9043058bcd5f", APIVersion:"v1", ResourceVersion:"1015", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
aron-ac commented 1 year ago

@davidjumani have you had a chance to look at this anymore/do you have any recommendations?

shwstppr commented 1 year ago

@aron-ac David is on vacation. I tried to reproduce the issue but for some reason even my k8s cluster created using admin account was showing the problem,

cloud@admin-k8s-1-control-1855dc8294c:~$ sudo /opt/bin/kubectl --namespace default get services -o wide traefik
NAME      TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE   SELECTOR
traefik   LoadBalancer   10.103.202.214   <pending>     80:30209/TCP,443:32344/TCP   11m   <none>

Same with a cluster created in a project

cloud@test-k8s-control-1855a453abb:~$ sudo /opt/bin/kubectl --namespace default get services -o wide traefik
NAME      TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE     SELECTOR
traefik   LoadBalancer   10.106.206.45   <pending>     80:30812/TCP,443:30551/TCP   8m38s   <none>

I'll try to investigate this and keep you posted.

aron-ac commented 1 year ago

@shwstppr i actually noticed the same issue yesterday as well, kubeadmin wasn't working in regular accounts. So things like autoscale, pods, and ingress controllers stay in a pending state. This is a pretty critical bug at this point because CKS is effectively no longer working

aron-ac commented 1 year ago
kubectl --namespace zammad port-forward $POD_NAME 8080:8080
error: unable to forward port because pod is not running. Current status=Pending
aron-ac commented 1 year ago

@davidjumani hope you had a great vacation. This bug has seemingly gotten worse in my environment, I'm seeing the same issue with regular users and project users. From the k8s cluster I am not able to provision/change any resources in cloudstack like you would expect to be able to do for an ingress controller as an example.

I tried disabling and re-abling the kubernetes services - no change - and I also tried creating a new domain so that a new kubeadmin account would be created - still no change.

davidjumani commented 1 year ago

@aron-ac I'll have a look and get back with a fix or any further questions

aron-ac commented 1 year ago

thanks @davidjumani !

aron-ac commented 1 year ago

@davidjumani I believe we have a decent root cause analysis...

for a project user spinning up a k8s cluster and attempting to create an ingress controller

Events:
  Type     Reason                  Age                 From                Message
  ----     ------                  ----                ----                -------
  Normal   EnsuringLoadBalancer    28s (x6 over 3m4s)  service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  28s (x6 over 3m3s)  service-controller  Error syncing load balancer: failed to ensure load balancer: could not find network              
aron-ac commented 1 year ago

@davidjumani so we tried access the API as that kubeadmin account and found it is unable to see what it needs to (networks, or even the kubernetes cluster itself) it can only see users in that project:
 

(K8Sadmin) 🐱 > sync
Discovered 335 APIs
(K8Sadmin) 🐱 > list networks 
(K8Sadmin) 🐱 >  
(K8Sadmin) 🐱 > list volumes
(K8Sadmin) 🐱 > list virtualmachines 
(K8Sadmin) 🐱 > list vpcs 
(K8Sadmin) 🐱 > list kubernetesclusters 
(K8Sadmin) 🐱 > list users filter=account
{
  "count": 3,
  "user": [
    {
      "account": "XXX-XXX-1007"
    },
    {
      "account": "XXX-XXX-1007"
    },
    {
      "account": "XXX-XXX-1007"
    }
  ]
}
(K8Sadmin) 🐱 > list users filter=account,username
{
  "count": 3,
  "user": [
    {
      "account": "XXX-XXX-1007",
      "username": "XXX-XXX-1007"
    },
    {
      "account": "XXX-XXX-1007",
      "username": "XXX-XXX-1007-kubeadmin"
    },
    {
      "account": "XXX-XXX-1007",
      "username": "YYY-YYY-3099328"
    }
  ]
}
aron-ac commented 1 year ago

@davidjumani we're going to look at perms now to see if we can suggest a patch, but assuming you probably have a better understanding and can accomplish what's needs to get done so wanted to give you all the info we have.

we should ensure that project based kubeadmin accounts have the correct access to the API in order to fully orchestrate jobs between project clusters and cloudstack

aron-ac commented 1 year ago

message from calling ‘list networks’ in CMK:
 

2023-01-04 19:44:23,310 DEBUG [o.a.c.a.BaseCmd] (qtp1418620248-2700:ctx-e6d05a5a ctx-471cd50a ctx-d7a7e92d) (logid:26f3b434) Ignoring parameter displaynetwork as the caller is not authorized to pass it in
aron-ac commented 1 year ago

cloud-controller-manager is deployed as a helper pod to reach back to the ACS api, it’s also complaining:

E0104 20:38:19.735805       1 controller.go:244] error processing service default/nginx-ingress-ingress-nginx-controller (will retry): failed to ensure load balancer: could not find network 
I0104 20:38:19.736182       1 event.go:278] Event(v1.ObjectReference{Kind:"Service", Namespace:"default", Name:"nginx-ingress-ingress-nginx-controller", UID:"97492914-626f-4f0d-bc6b-33b643803fdd", APIVersion:"v1", ResourceVersion:"8671", FieldPath:""}): type: 'Warning' reason: 'SyncLoadBalancerFailed' Error syncing load balancer: failed to ensure load balancer: could not find network

controller code throwing the same error

// associatePublicIPAddress associates a new IP and sets the address and it's ID.
func (lb *loadBalancer) associatePublicIPAddress() error {
    klog.V(4).Infof("Allocate new IP for load balancer: %v", lb.name)
    // If a network belongs to a VPC, the IP address needs to be associated with
    // the VPC instead of with the network.
    network, count, err := lb.Network.GetNetworkByID(lb.networkID, cloudstack.WithProject(lb.projectID))
    if err != nil {
        if count == 0 {
            return fmt.Errorf("could not find network %v", lb.networkID)
        }
        return fmt.Errorf("error retrieving network: %v", err)
    }
nate-ac commented 1 year ago

A little more RCA here:

When a project is created a "PrjAcct-" account is also created and given ownership to the project. This account is created with the "RO Admin" role.

https://github.com/apache/cloudstack/blob/20306d612928712e5354bad57691b5fe4e1f59a9/server/src/main/java/com/cloud/projects/ProjectManagerImpl.java#L266

                //Create an account associated with the project
                StringBuilder acctNm = new StringBuilder("PrjAcct-");
                acctNm.append(name).append("-").append(ownerFinal.getDomainId());

                Account projectAccount = _accountMgr.createAccount(acctNm.toString(), Account.Type.PROJECT, null, domainId, null, null, UUID.randomUUID().toString());

                Project project = _projectDao.persist(new ProjectVO(name, displayText, ownerFinal.getDomainId(), projectAccount.getId()));

                //assign owner to the project
                assignAccountToProject(project, ownerFinal.getId(), ProjectAccount.Role.Admin,
                        Optional.ofNullable(finalUser).map(User::getId).orElse(null),  null);

        if (project != null) {
            CallContext.current().setEventDetails("Project id=" + project.getId());
            CallContext.current().putContextParameter(Project.class, project.getUuid());
        }

Then, a subsequent account is created then added to the project and set as "Domain Admin". Within the account is the kubeadmin user. This user makes API calls to setup the nginx ingress controller like shown in the comments above. Some of the API calls result in an empty response, namely listNetworks.

Changing the "PrjAcct-" role from "RO Admin" to "Domain Admin" somehow allows the kubeadmin user the access it needs to "see" the resources and interact with them.

So there seems to be something wrong with project accounts accessing the resources that are owned by the "PriAcct-" RO Admin account or the account is being created with insufficient access.

Things we noted while troubleshooting:

assignToLoadBalancerRule
associateIpAddress
deleteFirewallRule
deleteLoadBalancerRule
disassociateIpAddress
listFirewallRules
listLoadBalancerRules
listNetworks
listVirtualMachines
queryAsyncJobResult
aron-ac commented 1 year ago

@davidjumani wondering if you've had any additional time to investigate

weizhouapache commented 1 year ago

@davidjumani are you working on this ? If not, I will have a look

davidjumani commented 1 year ago

Hi @aron-ac Sorry for the delay. I'll have a look at this. It appears as though it could be an issue with the cloud provider

weizhouapache commented 1 year ago

Reference

@davidjumani It seems project-id is not set in cloud-config (see https://github.com/apache/cloudstack-kubernetes-provider#kubernetes)

The script plugins/integrations/kubernetes-service/src/main/resources/script/deploy-cloudstack-secret does not support projectid

davidjumani commented 1 year ago

Hi @aron-ac I've created a fix for the issue Thanks @weizhouapache for identifying the problem

aron-ac commented 1 year ago

Thanks all!