Closed EdwardCooke closed 1 week ago
Thanks for logging this issue @EdwardCooke, can you tell us more about how you're doing the OIDC integration? What tool are you using?
I’m using the declarative way of doing authentication with a kubeadm cluster. The provider is azure. What do you mean by what tool am I using? It’s a fresh kubeadm cluster with a single control plane node. Add cilium. Add additional control plane nodes. The additional ones fail to auth using oidc with the error logged above. If you restart the kubernetes api server on the first control plane it’ll then start failing as well.
I’ll post the auth file when I get a chance.
Presumably kubeadm is configuring the apiserver flags for you per https://kubernetes.io/docs/reference/access-authn-authz/authentication/ ? (I wasn't aware this was even a thing until just now).
I suspect this scenario isn't well tested (or at all) by Cilium's CI, so it makes sense it might not be working.
Is there any chance you can generate a sysdump using https://docs.cilium.io/en/stable/operations/troubleshooting/#automatic-log-state-collection ? Ideally, you could use an admin credential to generate that sysdump and add it to this issue, then someone will have the information they need to take a look.
Otherwise, I'd encourage you to check the packet flow from the apiserver to the OIDC provider. Are there any network policies that may be capturing that traffic? Does the apiserver have access to the greater internet (since it will need it to connect directly to the OIDC provider)?
There isn't anything in the way of the traffic. Up until installing Cilium it works fine. It's only after installing Cilium that OIDC stops working. I'll generate that sysdump shortly.
🔍 Collecting sysdump with cilium-cli version: v0.16.18, args: [sysdump] 🔮 Detected Cilium installation in namespace: "kube-system" 🔮 Detected Cilium operator in namespace: "kube-system" ℹ️ Using default Cilium Helm release name: "cilium" ℹ️ Failed to detect Cilium SPIRE installation - using Cilium namespace as Cilium SPIRE namespace: "kube-system" 🔍 Collecting Kubernetes nodes 🔮 Detected Cilium features: map[bpf-lb-external-clusterip:Disabled cidr-match-nodes:Disabled clustermesh-enable-endpoint-sync:Disabled cni-chaining:Disabled:none enable-bgp-control-plane:Disabled enable-envoy-config:Disabled enable-gateway-api:Disabled enable-ipsec:Disabled enable-ipv4-egress-gateway:Disabled enable-local-redirect-policy:Disabled endpoint-routes:Disabled ingress-controller:Disabled ipam:Disabled:cluster-pool ipv4:Enabled ipv6:Disabled mutual-auth-spiffe:Disabled wireguard-encapsulate:Disabled] 🔍 Collecting tracing data from Cilium pods 🔍 Collect Kubernetes nodes 🔍 Collecting Kubernetes events 🔍 Collect Kubernetes version 🔍 Collecting Kubernetes pods 🔍 Collecting Kubernetes namespaces 🔍 Collecting Kubernetes services 🔍 Collecting Kubernetes pods summary 🔍 Collecting Kubernetes endpoints 🔍 Collecting Kubernetes network policies 🔍 Collecting Kubernetes metrics 🔍 Collecting Kubernetes leases 🔍 Collecting Cilium cluster-wide network policies 🔍 Collecting Cilium network policies 🔍 Collecting Cilium Egress Gateway policies 🔍 Collecting Cilium egress NAT policies 🔍 Collecting Cilium local redirect policies 🔍 Collecting Cilium CIDR Groups 🔍 Collecting Cilium endpoint slices 🔍 Collecting Cilium endpoints 🔍 Collecting Cilium nodes 🔍 Collecting Cilium identities 🔍 Collecting Ingresses 🔍 Collecting Cilium Node Configs 🔍 Collecting Cilium BGP Peering Policies 🔍 Collecting IngressClasses 🔍 Collecting Cilium Pod IP Pools 🔍 Collecting Cilium LoadBalancer IP Pools 🔍 Checking if cilium-etcd-secrets exists in kube-system namespace 🔍 Collecting the Cilium configuration 🔍 Collecting the Hubble Relay configuration 🔍 Collecting the Cilium daemonset(s) 🔍 Collecting the Hubble daemonset 🔍 Collecting the Hubble Relay deployment 🔍 Collecting the Hubble UI deployment 🔍 Collecting the Cilium Envoy configuration 🔍 Collecting the Cilium Node Init daemonset 🔍 Collecting the Cilium Envoy daemonset 🔍 Collecting the Hubble generate certs cronjob W0930 10:42:23.850699 13700 warnings.go:70] cilium.io/v2alpha1 CiliumNodeConfig will be deprecated in cilium v1.16; use cilium.io/v2 CiliumNodeConfig 🔍 Collecting the Hubble cert-manager certificates 🔍 Collecting the Hubble generate certs pod logs 🔍 Collecting the Cilium operator metrics 🔍 Collecting the Cilium operator deployment 🔍 Collecting the clustermesh debug information, metrics and gops stats ⚠️ cronjob "hubble-generate-certs" not found in namespace "kube-system" - this is expected if auto TLS is not enabled or if not using hubble.auto.tls.method=cronjob ⚠️ Deployment "hubble-ui" not found in namespace "kube-system" - this is expected if Hubble UI is not enabled ⚠️ Deployment "hubble-relay" not found in namespace "kube-system" - this is expected if Hubble is not enabled 🔍 Collecting gops stats from Hubble Relay pods 🔍 Collecting profiling data from Cilium pods 🔍 Collecting logs from Cilium pods ⚠️ Daemonset "cilium-node-init" not found in namespace "kube-system" - this is expected if Node Init DaemonSet is not enabled 🔍 Collecting the 'clustermesh-apiserver' deployment 🔍 Collecting the CNI configuration files from Cilium pods 🔍 Collecting the CNI configmap 🔍 Collecting gops stats from Cilium pods 🔍 Collecting gops stats from Cilium-operator pods 🔍 Collecting gops stats from Hubble pods 🔍 Collecting bugtool output from Cilium pods 🔍 Collecting logs from Cilium Envoy pods 🔍 Collecting logs from Cilium Node Init pods 🔍 Collecting logs from Cilium operator pods 🔍 Collecting logs from 'clustermesh-apiserver' pods ⚠️ Deployment "clustermesh-apiserver" not found in namespace "kube-system" - this is expected if 'clustermesh-apiserver' isn't enabled 🔍 Collecting logs from Hubble pods 🔍 Collecting logs from Hubble Relay pods Secret "cilium-etcd-secrets" not found in namespace "kube-system" - this is expected when using the CRD KVStore 🔍 Collecting logs from Hubble UI pods I0930 10:42:25.001594 13700 request.go:697] Waited for 1.16111413s due to client-side throttling, not priority and fairness, request: GET:https://kube1-cp.cookes.io:6443/api/v1/namespaces/kube-system/configmaps/hubble-relay-config 🔍 Collecting platform-specific data 🔍 Collecting kvstore data 🔍 Collecting Cilium external workloads 🔍 Collecting Hubble flows from Cilium pods 🔍 Collecting logs from Tetragon pods 🔍 Collecting logs from Tetragon operator pods 🔍 Collecting bugtool output from Tetragon pods 🔍 Collecting Tetragon configmap 🔍 Collecting Tetragon PodInfo custom resources 🔍 Collecting Tetragon tracing policies 🔍 Collecting Tetragon namespaced tracing policies 🔍 Collecting Helm metadata from the release 🔍 Collecting Helm values from the release I0930 10:43:00.672027 13700 request.go:697] Waited for 1.002747065s due to client-side throttling, not priority and fairness, request: GET:https://kube1-cp.cookes.io:6443/api/v1/namespaces/kube-system/pods/cilium-tlslg/log?container=config&limitBytes=1073741824&sinceTime=2023-10-01T10%3A42%3A23Z×tamps=true ⚠️ The following tasks failed, the sysdump may be incomplete: ⚠️ [13] Collecting Cilium egress NAT policies: failed to collect Cilium egress NAT policies: the server could not find the requested resource ⚠️ [14] Collecting Cilium Egress Gateway policies: failed to collect Cilium Egress Gateway policies: the server could not find the requested resource (get ciliumegressgatewaypolicies.cilium.io) ⚠️ [16] Collecting Cilium local redirect policies: failed to collect Cilium local redirect policies: the server could not find the requested resource (get ciliumlocalredirectpolicies.cilium.io) ⚠️ [18] Collecting Cilium endpoint slices: failed to collect Cilium endpoint slices: the server could not find the requested resource (get ciliumendpointslices.cilium.io) ⚠️ [24] Collecting Cilium BGP Peering Policies: failed to collect Cilium BGP Peering policies: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io) ⚠️ [34] Collecting the Hubble Relay configuration: failed to collect the Hubble Relay configuration: configmaps "hubble-relay-config" not found ⚠️ [39] Collecting the Hubble cert-manager certificates: failed to collect certificates (v1): the server could not find the requested resource ⚠️ [68] Collecting Tetragon PodInfo custom resources: failed to collect podinfo (v1alpha1): the server could not find the requested resource ⚠️ [69] Collecting Tetragon tracing policies: failed to collect tracingpolicies (v1alpha1): the server could not find the requested resource ⚠️ [70] Collecting Tetragon namespaced tracing policies: failed to collect tracingpoliciesnamespaced (v1alpha1): the server could not find the requested resource ⚠️ Please note that depending on your Cilium version and installation options, this may be expected 🗳 Compiling sysdump ✅ The sysdump has been saved to cilium-sysdump-20240930-104223.zip
Presumably kubeadm is configuring the apiserver flags for you per https://kubernetes.io/docs/reference/access-authn-authz/authentication/ ? (I wasn't aware this was even a thing until just now).
I suspect this scenario isn't well tested (or at all) by Cilium's CI, so it makes sense it might not be working.
Is there any chance you can generate a sysdump using https://docs.cilium.io/en/stable/operations/troubleshooting/#automatic-log-state-collection ? Ideally, you could use an admin credential to generate that sysdump and add it to this issue, then someone will have the information they need to take a look.
Otherwise, I'd encourage you to check the packet flow from the apiserver to the OIDC provider. Are there any network policies that may be capturing that traffic? Does the apiserver have access to the greater internet (since it will need it to connect directly to the OIDC provider)?
To do it with Kubeadm you specify the additional API arguments to do it. I'm using the declarative method where you put it in a config file and reference that, here's the what I use
Kubeadm cluster config
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
controlPlaneEndpoint: "kube1-cp.cookes.io:6443"
controllerManager:
extraArgs:
node-cidr-mask-size: "24"
# CIS 1.3.1
terminated-pod-gc-threshold: "10"
# TODO: Remove this so it goes back to the 1 year default, this is to test cert rotation/expiration
cluster-signing-duration: "0h10m0s"
# CIS 1.3.2
profiling: "FALSE"
# STIG V-242378
tls-min-version: VersionTLS13
networking:
serviceSubnet: "10.96.0.0/16"
podSubnet: "10.244.0.0/16"
dnsDomain: "cluster.local"
apiServer:
extraArgs:
# CIS 1.2.16
# STIG V-242465
# STIG V-242402
audit-log-path: /var/log/apiserver/audit.log
# CIS 1.2.17
# STIG V-242464
audit-log-maxage: "30"
# CIS 1.2.18
# STIG V-242463
audit-log-maxbackup: "10"
# CIS 1.2.19
# STIG V-242462
audit-log-maxsize: "100"
# CIS 1.2.29
tls-cipher-suites: "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA"
# CIS 3.2.1
# STIG V-242461
audit-policy-file: "/etc/kubernetes/config/audit-policy.yaml"
# CIS 1.2.6, 1.2.7, 1.2.8
# STIG V-242382
authorization-mode: Node,RBAC
# CIS 3.1.1, 3.1.2, 3.1.3
authentication-config: "/etc/kubernetes/config/kube-api-authn.yaml"
# CIS 1.2.11
enable-admission-plugins: AlwaysPullImages,NodeRestriction,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,DefaultIngressClass,DefaultStorageClass,DefaultTolerationSeconds,LimitRanger,MutatingAdmissionWebhook,NamespaceLifecycle,PersistentVolumeClaimResize,PodSecurity,Priority,ResourceQuota,RuntimeClass,ServiceAccount,StorageObjectInUseProtection,TaintNodesByCondition,ValidatingAdmissionPolicy,ValidatingAdmissionWebhook
# CIS 1.2.27
encryption-provider-config: "/etc/kubernetes/config/encryption.yaml"
encryption-provider-config-automatic-reload: "true"
# CIS 1.2.5
kubelet-certificate-authority: "/etc/kubernetes/pki/ca.crt"
# CIS 1.2.15
profiling: "FALSE"
# STIG V-254800
admission-control-config-file: "/etc/kubernetes/config/admission-configuration.yaml"
# STIG V-242378
tls-min-version: VersionTLS13
service-account-issuer: "https://kube1.cookes.io"
certSANs:
- "kube1-cp.cookes.io"
extraVolumes:
- name: auth
hostPath: "/etc/kubernetes/config/kube-api-authn.yaml"
mountPath: "/etc/kubernetes/config/kube-api-authn.yaml"
readOnly: true
pathType: File
- name: encryption-config
hostPath: "/etc/kubernetes/config/encryption.yaml"
mountPath: "/etc/kubernetes/config/encryption.yaml"
readOnly: true
pathType: File
- name: audit-policy
hostPath: "/etc/kubernetes/config/audit-policy.yaml"
mountPath: "/etc/kubernetes/config/audit-policy.yaml"
readOnly: true
pathType: File
- name: audit-log
hostPath: /var/log/apiserver
mountPath: /var/log/apiserver
readOnly: false
pathType: DirectoryOrCreate
- name: admission-configuration
hostPath: "/etc/kubernetes/config/admission-configuration.yaml"
mountPath: "/etc/kubernetes/config/admission-configuration.yaml"
readOnly: true
pathType: File
timeoutForControlPlane: 4m0s
scheduler:
extraArgs:
authentication-tolerate-lookup-failure: "false"
# CIS 1.4.1
profiling: "FALSE"
# STIG V-242377
tls-min-version: VersionTLS13
clusterName: "kube1"
etcd:
local:
extraArgs:
# STIG V-242380
peer-auto-tls: "false"
# STIG V-242379
auto-tls: "false"
Init config:
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
cloud-provider: "external"
node-ip: "10.2.0.20"
patches:
directory: /etc/kubernetes/config/patches
And authn config:
apiVersion: apiserver.config.k8s.io/v1beta1
kind: AuthenticationConfiguration
jwt:
- issuer:
url: https://login.microsoftonline.com/426bbc16-09cf-4a19-80c2-d5b19c6c4b72/v2.0
audiences:
- f6a6e027-18d6-431f-a310-5a1b9b09942d
claimMappings:
# username represents an option for the username attribute.
# This is the only required attribute.
username:
claim: "upn"
prefix: "oidc:"
groups:
claim: "roles"
prefix: "oidc:"
# Mutually exclusive with groups.claim and groups.prefix.
# expression is a CEL expression that evaluates to a string or a list of strings.
# expression: 'claims.roles.split(",")'
# uid represents an option for the uid attribute.
uid:
claim: 'oid'
Just finally figured out what happened, and it was a mistake on my part. My network is on the 10.0.0.0/8 subnet which is what the default is for Cilium. I was under the impression that it would use the cluster pod cidrs assigned when I did the kubeadm init, which it did not. As soon as I changed my cilium values to the below values and rebuilt the cluster (it's fresh and empty) everything worked fine
ipam:
mode: cluster-pool
operator:
clusterPoolIPv4PodCIDRList:
- 10.96.0.0/16
clusterPoolIPv4MaskSize: 24
Is there an existing issue for this?
Version
equal or higher than v1.16.0 and lower than v1.17.0
What happened?
When creating a new cluster and installing Cilium oidc login to the cluster fails.
This error is logged in the kube-apiserver pod:
How can we reproduce the issue?
Create a cluster with Kubeadm init and oidc authentication. Install Cilium through helm with the following values: kubeProxyReplacement: false is more reliable than when enabled, but still fails.
Cilium Version
cilium-cli: v0.16.15 compiled with go1.22.5 on linux/amd64 cilium image (default): v1.16.0 cilium image (stable): v1.16.1 cilium image (running): 1.17.0-dev
Kernel Version
Linux kube1-cp-01 6.8.0-44-generic #44-Ubuntu SMP PREEMPT_DYNAMIC Tue Aug 13 13:35:26 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
1.31
Regression
No response
Sysdump
cilium-sysdump-20240916-225601.zip
Relevant log output
Anything else?
When using Calico everything works as expected, I tried it to make sure something wasn't wrong with my network or cluster configuration.
Cilium Users Document
Code of Conduct