aws / secrets-store-csi-driver-provider-aws

The AWS provider for the Secrets Store CSI Driver allows you to fetch secrets from AWS Secrets Manager and AWS Systems Manager Parameter Store, and mount them into Kubernetes pods.
Apache License 2.0
438 stars 123 forks source link

Error with affinity #373

Open zavertiaev opened 4 weeks ago

zavertiaev commented 4 weeks ago

I have a simple deployment, and I need to mount secrets from AWS SM. The deployment has tolerations and affinity. When attempting to retrieve the secrets, I receive the following error: MountVolume.SetUp failed for volume "secrets" : rpc error: code = DeadlineExceeded desc = context deadline exceeded.

secrets-store-csi-driver-provider-aws pod logs:

I0628 09:56:14.620420       1 server.go:124] Servicing mount request for pod dev-test-77b9f4876d-qv26c in namespace dev using service account dev-secretmanager-sa with region(s) eu-west-1
I0628 09:56:14.624478       1 auth.go:123] Role ARN for dev:dev-secretmanager-sa is arn:aws:iam:::role/dev-secretmanager-role
W0628 09:58:14.595198       1 secrets_manager_provider.go:84] eu-west-1: Failed fetching secret dev/secret: RequestCanceled: request context canceled
caused by: context canceled
E0628 09:58:14.595275       1 server.go:151] Failure getting secret values from provider type secretsmanager: Failed to fetch secret from all regions: dev/secret

secrets-store-csi-driver and secrets-store-csi-driver-provider-aws are installed via Helm with default values and necessary tolerations. The pods of both daemonsets are running on the required node. When I run the deployment without affinity (the pods are deployed on a different node accordingly), it mounts correctly (which gives me reason to believe that service account, arn and other settings are configured correctly).

Steps to reproduce the behavior:

---
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: test
  namespace: dev
spec:
  provider: aws
  parameters:
    objects: |
        - objectName: "dev/secret"
          objectType: "secretsmanager"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dev-test
  namespace: dev
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: dev-test
  template:
    metadata:
      labels:
        app.kubernetes.io/name: dev-test
    spec:
      serviceAccountName: dev-secretmanager-sa
      containers:
        - name: test
          image: "alpine"
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: secrets
              mountPath: "/etc/secrets"
              readOnly: true
      volumes:
        - name: secrets
          csi:
            driver: secrets-store.csi.k8s.io
            readOnly: true
            volumeAttributes:
              secretProviderClass: "test"
      tolerations:
        - key: node-role/test
          operator: Equal
          value: 'true'
          effect: NoSchedule
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: test
                    operator: In
                    values:
                      - "true"

It seems that this bug has already been noticed #299 Any ideas please

simonmarty commented 4 days ago

Can you provide information about the environment difference between the tolerated nodes (node-role/test) and the other nodes in the cluster?

zavertiaev commented 4 days ago

There are two different node pools in Karpenter. Nodes without taints: NodePool

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default-node-pool
spec:
  template:
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values:
            - t3
            - t3a
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values:
            - xlarge
            - 2xlarge
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 168h

EC2NodeClass

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  instanceProfile: profile
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: cluster_name
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: cluster_name
  tags:
    karpenter.sh/discovery: cluster_name
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3

Tainted nodes: NodePool

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: node-pool-public
spec:
  template:
    metadata:
      labels:
        test: "true"
    spec:
      taints:
        - key: node-role/test
          value: "true"
          effect: NoSchedule
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: public
      requirements:
        - key: test
          operator: Exists
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values:
            - c6i
            - c6a
            - c7i
            - c7a
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values:
            - 2xlarge
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 168h

EC2NodeClass

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: public
spec:
  amiFamily: AL2
  instanceProfile: profile
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: cluster_name
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: cluster_name
  tags:
    karpenter.sh/discovery: cluster_name
  associatePublicIPAddress: true
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
simonmarty commented 3 days ago

Hmm, I'm not seeing anything in there that suggests the CSI driver would fail to work on the tainted nodes.

Are you able to reproduce this on two completely identical node pools (with the only diff being the taint)?

Did the CSI Driver and the AWS provider deploy to the tainted nodes successfully?

zavertiaev commented 1 day ago

I completely forgot that, besides tolerations, I also have affinity (I have corrected the first post and the title). And the problem is specifically due to affinity, not tolerations.

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: test
              operator: In
              values:
                - "true"

CSI Driver and the AWS provider are deployed on tainted nodes. The logs in the first post are from the AWS provider. Here is the full log, I don’t see any errors:

CSI Driver / node-driver-registar container:

I0725 04:36:33.015395       1 main.go:135] Version: v2.10.0
I0725 04:36:33.015453       1 main.go:136] Running node-driver-registrar in mode=
I0725 04:36:33.015459       1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0725 04:36:33.015474       1 connection.go:215] Connecting to unix:///csi/csi.sock
I0725 04:36:36.011223       1 main.go:164] Calling CSI driver to discover driver name
I0725 04:36:36.011249       1 connection.go:244] GRPC call: /csi.v1.Identity/GetPluginInfo
I0725 04:36:36.011254       1 connection.go:245] GRPC request: {}
I0725 04:36:36.013031       1 connection.go:251] GRPC response: {"name":"secrets-store.csi.k8s.io","vendor_version":"v1.4.3"}
I0725 04:36:36.013046       1 connection.go:252] GRPC error: <nil>
I0725 04:36:36.013055       1 main.go:173] CSI driver name: "secrets-store.csi.k8s.io"
I0725 04:36:36.013074       1 node_register.go:55] Starting Registration Server at: /registration/secrets-store.csi.k8s.io-reg.sock
I0725 04:36:36.013221       1 node_register.go:64] Registration Server started at: /registration/secrets-store.csi.k8s.io-reg.sock
I0725 04:36:36.013275       1 node_register.go:88] Skipping HTTP server because endpoint is set to: ""
I0725 04:36:36.027626       1 main.go:90] Received GetInfo call: &InfoRequest{}
I0725 04:36:36.050881       1 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

CSI Driver / secrets-store container:

I0725 04:36:34.866715       1 exporter.go:35] "initializing metrics backend" backend="prometheus"
I0725 04:36:34.963880       1 main.go:195] "starting manager\n"
I0725 04:36:35.064146       1 secrets-store.go:46] "Initializing Secrets Store CSI Driver" driver="secrets-store.csi.k8s.io" version="v1.4.3" buildTime="2024-04-17-17:59"
I0725 04:36:35.066446       1 server.go:126] "Listening for connections" address="//csi/csi.sock"
I0725 04:36:36.029150       1 nodeserver.go:359] "node: getting default node info\n"

CSI Driver / liveness-probe container:

I0725 04:36:35.666342       1 main.go:133] "Calling CSI driver to discover driver name"
I0725 04:36:35.760593       1 main.go:141] "CSI driver name" driver="secrets-store.csi.k8s.io"
I0725 04:36:35.760629       1 main.go:170] "ServeMux listening" address="0.0.0.0:9808"

AWS provider:

I0725 04:36:35.062903       1 main.go:34] Starting secrets-store-csi-driver-provider-aws version 1.0.r2-72-gfb78a36-2024.05.29.23.03
I0725 04:36:35.159133       1 main.go:82] Listening for connections on address: /etc/kubernetes/secrets-store-csi-providers/aws.sock