cowboysysop / charts

Cowboy Sysop Charts
MIT License
113 stars 68 forks source link

VPA Helm Chart.: Updater error: fail to get pod controller, node is not a valid owner #656

Closed sarg3nt closed 1 month ago

sarg3nt commented 2 months ago

I'm deploying the latest chart with no configuration changes and am finding that the kube-system/vertical-pod-autoscaler-updater Is throwing the following errors:

vertical-pod-autoscaler-updater-bdcd45465-qgdh4 E0516 17:03:33.841287       1 api.go:153] fail to get pod controller: pod=kube-proxy-lpul-vault-k8s-server-0.vault.ad.selinc.com err=Unhandled targetRef v1 / Node / lpul-vault-k8s-server-0.vault.ad.selinc.com, last error node is not a valid owner
vertical-pod-autoscaler-updater-bdcd45465-qgdh4 E0516 17:03:33.841304       1 api.go:153] fail to get pod controller: pod=cloud-controller-manager-lpul-vault-k8s-server-2.vault.ad.selinc.com err=Unhandled targetRef v1 / Node / lpul-vault-k8s-server-2.vault.ad.selinc.com, last error node is not a valid owner
vertical-pod-autoscaler-updater-bdcd45465-qgdh4 E0516 17:03:33.841316       1 api.go:153] fail to get pod controller: pod=kube-apiserver-lpul-vault-k8s-server-0.vault.ad.selinc.com err=Unhandled targetRef v1 / Node / lpul-vault-k8s-server-0.vault.ad.selinc.com, last error node is not a valid owner
vertical-pod-autoscaler-updater-bdcd45465-qgdh4 E0516 17:03:33.841325       1 api.go:153] fail to get pod controller: pod=kube-apiserver-lpul-vault-k8s-server-2.vault.ad.selinc.com err=Unhandled targetRef v1 / Node / lpul-vault-k8s-server-2.vault.ad.selinc.com, last error node is not a valid owner 
vertical-pod-autoscaler-updater-bdcd45465-qgdh4 E0516 17:03:33.841331       1 api.go:153] fail to get pod controller: pod=kube-proxy-lpul-vault-k8s-server-2.vault.ad.selinc.com err=Unhandled targetRef v1 / Node / lpul-vault-k8s-server-2.vault.ad.selinc.com, last error node is not a valid owner 
etc.

The only ones that seem to work are in the kube-system namespace:

vertical-pod-autoscaler-updater-bdcd45465-qgdh4 I0516 17:03:33.841558       1 pods_eviction_restriction.go:226] too few replicas for ReplicaSet kube-system/rke2-snapshot-controller-59cc9cd8f4. Found 1 live pods, needs 2 (global 2) 
vertical-pod-autoscaler-updater-bdcd45465-qgdh4 I0516 17:03:33.841585       1 pods_eviction_restriction.go:226] too few replicas for ReplicaSet kube-system/rke2-snapshot-validation-webhook-54c5989b65. Found 1 live pods, needs 2 (global 2) 
vertical-pod-autoscaler-updater-bdcd45465-qgdh4 I0516 17:03:33.841604       1 pods_eviction_restriction.go:226] too few replicas for ReplicaSet kube-system/rke2-metrics-server-655477f655. Found 1 live pods, needs 2 (global 2)        

I have the Terraform to deploy the raw files and that works fine but would like to switch to your Helm chart, which is not working. I tried comparing what the raw files deploy for ClusterRoles vs the Helm chart but they are so different the comparison is difficult.

In any case, this does not appear to work for us. Maybe the version of k8s?

Specs:

sebastien-prudhomme commented 2 months ago

Hi @sarg3nt, it seems it's related to this bug in the latest version of the app, can you try the chart in version 9.7.0? https://github.com/kubernetes/autoscaler/issues/6808

sebastien-prudhomme commented 1 month ago

It should be fixed by #657

sarg3nt commented 1 month ago

I tried the new version and am getting the same error. I confirmed the update is now at 1.1.2 autoscaling/vpa-updater:1.1.2

0517 23:33:11.844076       1 api.go:153] fail to get pod controller: pod=cloud-controller-manager-lpul-vault-k8s-server-1.vault.ad.selinc.com err=Unhandled target ││ Ref v1 / Node / lpul-vault-k8s-server-1.vault.ad.selinc.com, last error node is not a valid owner   
sarg3nt commented 1 month ago

Update: I noticed my custom deployment gives me those errors for the kube-system static pods as well, so I think that is normal, which kind of makes sense? However the Helm chart deployment is only trying to update stuff in the kube-system namespace whereas my custom deployment updates everything.

I'm not seeing a config option that would limit it to the kube-system namespace. Am I missing something?

Also, when I deploy the helm chart with Terraform the VPA resources fail to deploy. It's like the Helm chart finished the install but the CRD's are not quite up yet. When I install my custom version this doesn't happen. I have the same Terraform depends_on logic in place so I'm not sure why it's doing this.

A question as well. How does the chart handle certificate renewal? Does it do it automatically on chart upgrade or are the certs going to expire?