awslabs / mountpoint-s3-csi-driver

Built on Mountpoint for Amazon S3, the Mountpoint CSI driver presents an Amazon S3 bucket as a storage volume accessible by containers in your Kubernetes cluster.
Apache License 2.0
151 stars 18 forks source link

Bottlerocket AMI mounting fail event in pod #168

Open hanselblack opened 3 months ago

hanselblack commented 3 months ago

/kind bug What happened? When using the Bottlerocket AMI with Karpenter NodeClass. Describing the pod, the events shows:

 Warning  FailedMount       3m54s (x7 over 4m26s)  kubelet            MountVolume.MountDevice failed for volume "3416296-pv" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name s3.csi.aws.com not found in the list of registered CSI drivers
kubectl describe csidrivers.storage.k8s.io/s3.csi.aws.com

Name:         s3.csi.aws.com
Namespace:    
Labels:       app.kubernetes.io/component=csi-driver
              app.kubernetes.io/instance=aws-mountpoint-s3-csi-driver
              app.kubernetes.io/managed-by=EKS
              app.kubernetes.io/name=aws-mountpoint-s3-csi-driver
Annotations:  <none>
API Version:  storage.k8s.io/v1
Kind:         CSIDriver
Metadata:
  Creation Timestamp:  2024-02-07T02:01:48Z
  Resource Version:    5363335
  UID:                 c7037a7c-edc6-473b-bcab-4c9443cdef7f
Spec:
  Attach Required:     false
  Fs Group Policy:     ReadWriteOnceWithFSType
  Pod Info On Mount:   false
  Requires Republish:  false
  Se Linux Mount:      false
  Storage Capacity:    false
  Volume Lifecycle Modes:
    Persistent
Events:  <none>

This error does not appear in when using AL2 AMI. However, even with the warning, I am still able to read data from the S3 mountpoint.

What you expected to happen? No warnings messages.

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

Environment

jjkr commented 3 months ago

I am not able to reproduce this with basic mounting on Bottlerocket. Any more logs or information about your configuration will be helpful. I'm interested on how you are actually deploying this and the timing between events. Given that the mount does succeed and is functional, it seems like this could just be a timing issue if the pv is trying to mount while the driver is still coming up, but that is speculation.

hanselblack commented 3 months ago
apiVersion: v1
kind: PersistentVolume
metadata:
  name: xxx-pv
  namespace: default
spec:
  capacity:
    storage: 1200Gi
  accessModes:
    - ReadWriteMany
  mountOptions:
    - allow-overwrite
    - region ap-southeast-1
    - max-threads 16
  csi:
    driver: s3.csi.aws.com
    volumeHandle: s3-csi-driver-volume-output
    volumeAttributes:
      bucketName: xxx
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: xxx-pvc
  namespace: default
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  resources:
    requests:
      storage: 1200Gi
  volumeName: xxx-pv
---
apiVersion: batch/v1
kind: Job
metadata:
  name: xxx-job
  namespace: default
spec:
  template:
    metadata:
      labels:
        app: xxx-job
    spec:
      nodeSelector:
        type: gpu
      containers:
        - name: xxx
          image: # AWS ECR image URI
          imagePullPolicy: Always
          command: ["/bin/sh", "-c"]
          args:
            - cp -r /tmp/mount/xxx /usr/src/app/;
          resources:
            limits:
              memory: 10000Mi
              nvidia.com/gpu: 1
            requests:
              memory: 10000Mi
              cpu: 4000m
              nvidia.com/gpu: 1
          volumeMounts:
            - name: persistent-storage-data
              mountPath: /tmp/mount
      volumes:
        - name: persistent-storage-data
          persistentVolumeClaim:
            claimName: xxx-pvc

The above is the manifest for the deployment. The nodes are scaled up through Karpenter, using spec.amiFamily Bottlerocket runs with GPU. The driver is installed by EKS addon, and the kube-system name-space is on fargate-profile.

Yeah it could be timing issue. Oddly, dint have this issue on AL2.