Open zavertiaev opened 4 weeks ago
Can you provide information about the environment difference between the tolerated nodes (node-role/test
) and the other nodes in the cluster?
There are two different node pools in Karpenter. Nodes without taints: NodePool
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default-node-pool
spec:
template:
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
name: default
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values:
- t3
- t3a
- key: karpenter.k8s.aws/instance-size
operator: In
values:
- xlarge
- 2xlarge
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 168h
EC2NodeClass
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
instanceProfile: profile
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: cluster_name
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: cluster_name
tags:
karpenter.sh/discovery: cluster_name
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
Tainted nodes: NodePool
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: node-pool-public
spec:
template:
metadata:
labels:
test: "true"
spec:
taints:
- key: node-role/test
value: "true"
effect: NoSchedule
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
name: public
requirements:
- key: test
operator: Exists
- key: karpenter.k8s.aws/instance-family
operator: In
values:
- c6i
- c6a
- c7i
- c7a
- key: karpenter.k8s.aws/instance-size
operator: In
values:
- 2xlarge
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 168h
EC2NodeClass
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: public
spec:
amiFamily: AL2
instanceProfile: profile
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: cluster_name
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: cluster_name
tags:
karpenter.sh/discovery: cluster_name
associatePublicIPAddress: true
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
Hmm, I'm not seeing anything in there that suggests the CSI driver would fail to work on the tainted nodes.
Are you able to reproduce this on two completely identical node pools (with the only diff being the taint)?
Did the CSI Driver and the AWS provider deploy to the tainted nodes successfully?
I completely forgot that, besides tolerations, I also have affinity (I have corrected the first post and the title). And the problem is specifically due to affinity, not tolerations.
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: test
operator: In
values:
- "true"
CSI Driver and the AWS provider are deployed on tainted nodes. The logs in the first post are from the AWS provider. Here is the full log, I don’t see any errors:
CSI Driver / node-driver-registar container:
I0725 04:36:33.015395 1 main.go:135] Version: v2.10.0
I0725 04:36:33.015453 1 main.go:136] Running node-driver-registrar in mode=
I0725 04:36:33.015459 1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0725 04:36:33.015474 1 connection.go:215] Connecting to unix:///csi/csi.sock
I0725 04:36:36.011223 1 main.go:164] Calling CSI driver to discover driver name
I0725 04:36:36.011249 1 connection.go:244] GRPC call: /csi.v1.Identity/GetPluginInfo
I0725 04:36:36.011254 1 connection.go:245] GRPC request: {}
I0725 04:36:36.013031 1 connection.go:251] GRPC response: {"name":"secrets-store.csi.k8s.io","vendor_version":"v1.4.3"}
I0725 04:36:36.013046 1 connection.go:252] GRPC error: <nil>
I0725 04:36:36.013055 1 main.go:173] CSI driver name: "secrets-store.csi.k8s.io"
I0725 04:36:36.013074 1 node_register.go:55] Starting Registration Server at: /registration/secrets-store.csi.k8s.io-reg.sock
I0725 04:36:36.013221 1 node_register.go:64] Registration Server started at: /registration/secrets-store.csi.k8s.io-reg.sock
I0725 04:36:36.013275 1 node_register.go:88] Skipping HTTP server because endpoint is set to: ""
I0725 04:36:36.027626 1 main.go:90] Received GetInfo call: &InfoRequest{}
I0725 04:36:36.050881 1 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
CSI Driver / secrets-store container:
I0725 04:36:34.866715 1 exporter.go:35] "initializing metrics backend" backend="prometheus"
I0725 04:36:34.963880 1 main.go:195] "starting manager\n"
I0725 04:36:35.064146 1 secrets-store.go:46] "Initializing Secrets Store CSI Driver" driver="secrets-store.csi.k8s.io" version="v1.4.3" buildTime="2024-04-17-17:59"
I0725 04:36:35.066446 1 server.go:126] "Listening for connections" address="//csi/csi.sock"
I0725 04:36:36.029150 1 nodeserver.go:359] "node: getting default node info\n"
CSI Driver / liveness-probe container:
I0725 04:36:35.666342 1 main.go:133] "Calling CSI driver to discover driver name"
I0725 04:36:35.760593 1 main.go:141] "CSI driver name" driver="secrets-store.csi.k8s.io"
I0725 04:36:35.760629 1 main.go:170] "ServeMux listening" address="0.0.0.0:9808"
AWS provider:
I0725 04:36:35.062903 1 main.go:34] Starting secrets-store-csi-driver-provider-aws version 1.0.r2-72-gfb78a36-2024.05.29.23.03
I0725 04:36:35.159133 1 main.go:82] Listening for connections on address: /etc/kubernetes/secrets-store-csi-providers/aws.sock
I have a simple deployment, and I need to mount secrets from AWS SM. The deployment has tolerations and affinity. When attempting to retrieve the secrets, I receive the following error:
MountVolume.SetUp failed for volume "secrets" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
.secrets-store-csi-driver-provider-aws pod logs:
secrets-store-csi-driver
andsecrets-store-csi-driver-provider-aws
are installed via Helm with default values and necessary tolerations. The pods of both daemonsets are running on the required node. When I run the deployment without affinity (the pods are deployed on a different node accordingly), it mounts correctly (which gives me reason to believe that service account, arn and other settings are configured correctly).Steps to reproduce the behavior:
It seems that this bug has already been noticed #299 Any ideas please