Closed aron-ac closed 1 year ago
@aron-ac Hi, can you please simplify the reproduction steps maybe in the form of a list
@aron-ac Hi, can you please simplify the reproduction steps maybe in the form of a list
@shwstppr sure
in cloudstack 4.17.1.0
@shwstppr have you been able to check back in on this issue?
@davidjumani can you please comment on this? For ACS to provision resources based on k8s deployments won't we need kubernetes-provider being setup or is being done by default now?
@aron-ac Can you please provide the ingress-nginx logs ?
@davidjumani sure see below, I don't think this is an nginx issue though, as the same problem occurs with a traefik ingress controller, and auto scale doesn't work properly. im fairly confident this is an issue with kubeadmin for projects
kubectl logs nginx-ingress-ingress-nginx-controller-5b8c45b6f6-5lx8g -n default
-------------------------------------------------------------------------------
NGINX Ingress controller
Release: v1.5.1
Build: d003aae913cc25f375deb74f898c7f3c65c06f05
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.21.6
-------------------------------------------------------------------------------
W1223 15:49:48.515225 8 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I1223 15:49:48.515472 8 main.go:209] "Creating API client" host="https://10.96.0.1:443"
I1223 15:49:48.563849 8 main.go:253] "Running in Kubernetes cluster" major="1" minor="24" git="v1.24.0" state="clean" commit="4ce5a8954017644c5420bae81d72b09b735c21f0" platform="linux/amd64"
I1223 15:49:48.875095 8 main.go:104] "SSL fake certificate created" file="/etc/ingress-controller/ssl/default-fake-certificate.pem"
I1223 15:49:48.930114 8 ssl.go:533] "loading tls certificate" path="/usr/local/certificates/cert" key="/usr/local/certificates/key"
I1223 15:49:49.016798 8 nginx.go:260] "Starting NGINX Ingress controller"
I1223 15:49:49.049958 8 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"default", Name:"nginx-ingress-ingress-nginx-controller", UID:"5fc64c71-e757-4636-a641-b3b5a6cd872e", APIVersion:"v1", ResourceVersion:"985", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap default/nginx-ingress-ingress-nginx-controller
I1223 15:49:50.219486 8 nginx.go:303] "Starting NGINX process"
I1223 15:49:50.220052 8 leaderelection.go:248] attempting to acquire leader lease default/nginx-ingress-ingress-nginx-leader...
I1223 15:49:50.221934 8 nginx.go:323] "Starting validation webhook" address=":8443" certPath="/usr/local/certificates/cert" keyPath="/usr/local/certificates/key"
I1223 15:49:50.222420 8 controller.go:168] "Configuration changes detected, backend reload required"
I1223 15:49:50.243757 8 leaderelection.go:258] successfully acquired lease default/nginx-ingress-ingress-nginx-leader
I1223 15:49:50.244193 8 status.go:84] "New leader elected" identity="nginx-ingress-ingress-nginx-controller-5b8c45b6f6-5lx8g"
I1223 15:49:50.340986 8 controller.go:185] "Backend successfully reloaded"
I1223 15:49:50.341116 8 controller.go:196] "Initial sync, sleeping for 1 second"
I1223 15:49:50.341630 8 event.go:285] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"nginx-ingress-ingress-nginx-controller-5b8c45b6f6-5lx8g", UID:"6d660be6-c951-4420-a534-9043058bcd5f", APIVersion:"v1", ResourceVersion:"1015", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
@davidjumani have you had a chance to look at this anymore/do you have any recommendations?
@aron-ac David is on vacation. I tried to reproduce the issue but for some reason even my k8s cluster created using admin account was showing the problem,
cloud@admin-k8s-1-control-1855dc8294c:~$ sudo /opt/bin/kubectl --namespace default get services -o wide traefik
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
traefik LoadBalancer 10.103.202.214 <pending> 80:30209/TCP,443:32344/TCP 11m <none>
Same with a cluster created in a project
cloud@test-k8s-control-1855a453abb:~$ sudo /opt/bin/kubectl --namespace default get services -o wide traefik
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
traefik LoadBalancer 10.106.206.45 <pending> 80:30812/TCP,443:30551/TCP 8m38s <none>
I'll try to investigate this and keep you posted.
@shwstppr i actually noticed the same issue yesterday as well, kubeadmin wasn't working in regular accounts. So things like autoscale, pods, and ingress controllers stay in a pending state. This is a pretty critical bug at this point because CKS is effectively no longer working
kubectl --namespace zammad port-forward $POD_NAME 8080:8080
error: unable to forward port because pod is not running. Current status=Pending
@davidjumani hope you had a great vacation. This bug has seemingly gotten worse in my environment, I'm seeing the same issue with regular users and project users. From the k8s cluster I am not able to provision/change any resources in cloudstack like you would expect to be able to do for an ingress controller as an example.
I tried disabling and re-abling the kubernetes services - no change - and I also tried creating a new domain so that a new kubeadmin account would be created - still no change.
@aron-ac I'll have a look and get back with a fix or any further questions
thanks @davidjumani !
@davidjumani I believe we have a decent root cause analysis...
for a project user spinning up a k8s cluster and attempting to create an ingress controller
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal EnsuringLoadBalancer 28s (x6 over 3m4s) service-controller Ensuring load balancer
Warning SyncLoadBalancerFailed 28s (x6 over 3m3s) service-controller Error syncing load balancer: failed to ensure load balancer: could not find network
@davidjumani so we tried access the API as that kubeadmin account and found it is unable to see what it needs to (networks, or even the kubernetes cluster itself) it can only see users in that project:
(K8Sadmin) 🐱 > sync
Discovered 335 APIs
(K8Sadmin) 🐱 > list networks
(K8Sadmin) 🐱 >
(K8Sadmin) 🐱 > list volumes
(K8Sadmin) 🐱 > list virtualmachines
(K8Sadmin) 🐱 > list vpcs
(K8Sadmin) 🐱 > list kubernetesclusters
(K8Sadmin) 🐱 > list users filter=account
{
"count": 3,
"user": [
{
"account": "XXX-XXX-1007"
},
{
"account": "XXX-XXX-1007"
},
{
"account": "XXX-XXX-1007"
}
]
}
(K8Sadmin) 🐱 > list users filter=account,username
{
"count": 3,
"user": [
{
"account": "XXX-XXX-1007",
"username": "XXX-XXX-1007"
},
{
"account": "XXX-XXX-1007",
"username": "XXX-XXX-1007-kubeadmin"
},
{
"account": "XXX-XXX-1007",
"username": "YYY-YYY-3099328"
}
]
}
@davidjumani we're going to look at perms now to see if we can suggest a patch, but assuming you probably have a better understanding and can accomplish what's needs to get done so wanted to give you all the info we have.
we should ensure that project based kubeadmin accounts have the correct access to the API in order to fully orchestrate jobs between project clusters and cloudstack
message from calling ‘list networks’ in CMK:
2023-01-04 19:44:23,310 DEBUG [o.a.c.a.BaseCmd] (qtp1418620248-2700:ctx-e6d05a5a ctx-471cd50a ctx-d7a7e92d) (logid:26f3b434) Ignoring parameter displaynetwork as the caller is not authorized to pass it in
cloud-controller-manager is deployed as a helper pod to reach back to the ACS api, it’s also complaining:
E0104 20:38:19.735805 1 controller.go:244] error processing service default/nginx-ingress-ingress-nginx-controller (will retry): failed to ensure load balancer: could not find network
I0104 20:38:19.736182 1 event.go:278] Event(v1.ObjectReference{Kind:"Service", Namespace:"default", Name:"nginx-ingress-ingress-nginx-controller", UID:"97492914-626f-4f0d-bc6b-33b643803fdd", APIVersion:"v1", ResourceVersion:"8671", FieldPath:""}): type: 'Warning' reason: 'SyncLoadBalancerFailed' Error syncing load balancer: failed to ensure load balancer: could not find network
controller code throwing the same error
// associatePublicIPAddress associates a new IP and sets the address and it's ID.
func (lb *loadBalancer) associatePublicIPAddress() error {
klog.V(4).Infof("Allocate new IP for load balancer: %v", lb.name)
// If a network belongs to a VPC, the IP address needs to be associated with
// the VPC instead of with the network.
network, count, err := lb.Network.GetNetworkByID(lb.networkID, cloudstack.WithProject(lb.projectID))
if err != nil {
if count == 0 {
return fmt.Errorf("could not find network %v", lb.networkID)
}
return fmt.Errorf("error retrieving network: %v", err)
}
A little more RCA here:
When a project is created a "PrjAcct-" account is also created and given ownership to the project. This account is created with the "RO Admin" role.
//Create an account associated with the project
StringBuilder acctNm = new StringBuilder("PrjAcct-");
acctNm.append(name).append("-").append(ownerFinal.getDomainId());
Account projectAccount = _accountMgr.createAccount(acctNm.toString(), Account.Type.PROJECT, null, domainId, null, null, UUID.randomUUID().toString());
Project project = _projectDao.persist(new ProjectVO(name, displayText, ownerFinal.getDomainId(), projectAccount.getId()));
//assign owner to the project
assignAccountToProject(project, ownerFinal.getId(), ProjectAccount.Role.Admin,
Optional.ofNullable(finalUser).map(User::getId).orElse(null), null);
if (project != null) {
CallContext.current().setEventDetails("Project id=" + project.getId());
CallContext.current().putContextParameter(Project.class, project.getUuid());
}
Then, a subsequent account is created then added to the project and set as "Domain Admin". Within the account is the kubeadmin user. This user makes API calls to setup the nginx ingress controller like shown in the comments above. Some of the API calls result in an empty response, namely listNetworks.
Changing the "PrjAcct-" role from "RO Admin" to "Domain Admin" somehow allows the kubeadmin user the access it needs to "see" the resources and interact with them.
So there seems to be something wrong with project accounts accessing the resources that are owned by the "PriAcct-" RO Admin account or the account is being created with insufficient access.
Things we noted while troubleshooting:
assignToLoadBalancerRule
associateIpAddress
deleteFirewallRule
deleteLoadBalancerRule
disassociateIpAddress
listFirewallRules
listLoadBalancerRules
listNetworks
listVirtualMachines
queryAsyncJobResult
@davidjumani wondering if you've had any additional time to investigate
@davidjumani are you working on this ? If not, I will have a look
Hi @aron-ac Sorry for the delay. I'll have a look at this. It appears as though it could be an issue with the cloud provider
Reference
@davidjumani
It seems project-id
is not set in cloud-config
(see https://github.com/apache/cloudstack-kubernetes-provider#kubernetes)
The script plugins/integrations/kubernetes-service/src/main/resources/script/deploy-cloudstack-secret
does not support projectid
Hi @aron-ac I've created a fix for the issue Thanks @weizhouapache for identifying the problem
Thanks all!
ISSUE TYPE
COMPONENT NAME
CLOUDSTACK VERSION
CONFIGURATION
OS / ENVIRONMENT
kubernetes 1.24
SUMMARY
It appears https://github.com/apache/cloudstack/issues/6344 started to address the issue of project based users and kubernetes, but looking at the fix it looks to have only been applied for auto scaling perhaps.
inside of a project as a domain admin, kubeadmin does not work from inside the k8s cluster kube.conf using kubectl. I cannot acquire a new IP address for an ingress controller and presumably cannot complete any tasks as kubeadmin because looking at cloudstack events, kubeadmin is never called
STEPS TO REPRODUCE
i tested this by creating a k8s cluster as an admin account and from inside the k8s cluster creating an nginx ingress controller. there was no issue:
then i created a project and created a domain admin user for that project and deployed another k8s cluster and attempted to deploy an nginx ingress controller and a traefik ingress but the external IP stayed in a pending state:
looking in the cloudstack events i noticed that with the normal account that kubeadmin successfully acquired the new pub ip for a load balancer in cloudstack (nginx ingress in k8s). but in the project account domain admin, kubeadmin was never recognized.
EXPECTED RESULTS
As a project user deploying a k8s cluster I should still be able to use kubectl and access cloudstack kubeadmin
then
ACTUAL RESULTS
then