Open hanselblack opened 8 months ago
I am not able to reproduce this with basic mounting on Bottlerocket. Any more logs or information about your configuration will be helpful. I'm interested on how you are actually deploying this and the timing between events. Given that the mount does succeed and is functional, it seems like this could just be a timing issue if the pv is trying to mount while the driver is still coming up, but that is speculation.
apiVersion: v1
kind: PersistentVolume
metadata:
name: xxx-pv
namespace: default
spec:
capacity:
storage: 1200Gi
accessModes:
- ReadWriteMany
mountOptions:
- allow-overwrite
- region ap-southeast-1
- max-threads 16
csi:
driver: s3.csi.aws.com
volumeHandle: s3-csi-driver-volume-output
volumeAttributes:
bucketName: xxx
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: xxx-pvc
namespace: default
spec:
accessModes:
- ReadWriteMany
storageClassName: ""
resources:
requests:
storage: 1200Gi
volumeName: xxx-pv
---
apiVersion: batch/v1
kind: Job
metadata:
name: xxx-job
namespace: default
spec:
template:
metadata:
labels:
app: xxx-job
spec:
nodeSelector:
type: gpu
containers:
- name: xxx
image: # AWS ECR image URI
imagePullPolicy: Always
command: ["/bin/sh", "-c"]
args:
- cp -r /tmp/mount/xxx /usr/src/app/;
resources:
limits:
memory: 10000Mi
nvidia.com/gpu: 1
requests:
memory: 10000Mi
cpu: 4000m
nvidia.com/gpu: 1
volumeMounts:
- name: persistent-storage-data
mountPath: /tmp/mount
volumes:
- name: persistent-storage-data
persistentVolumeClaim:
claimName: xxx-pvc
The above is the manifest for the deployment.
The nodes are scaled up through Karpenter, using spec.amiFamily Bottlerocket
runs with GPU.
The driver is installed by EKS addon, and the kube-system
name-space is on fargate-profile.
Yeah it could be timing issue. Oddly, dint have this issue on AL2.
Hi, sorry for the delay in processing this issue, are you still facing this problem in v1.7.0?
Hi, sorry for the delay in processing this issue, are you still facing this problem in v1.7.0?
Yes this issue is still happening in 1.7.0, however for me it's happening without bottleRocket
Seems like this is the same issue with https://github.com/awslabs/mountpoint-s3-csi-driver/issues/107.
Setting node.tolerateAllTaints
to true
or node.tolerations
to an array of tolerations should fix the problem. For example:
$ aws eks create-addon --cluster-name ... \
--addon-name aws-mountpoint-s3-csi-driver \
--service-account-role-arn ... \
--configuration-values '{"node":{"tolerateAllTaints":true}}'
Could you please try upgrading to v1.8.0 with a toleration config to see if that solves the problem?
Hey @unexge
I also encountered a similar problem on EKS spot instances.
I updated the plugin version to v1. 8. 0 and installed it with the node.tolerateAllTaints = true
parameter, but on some nodes I still get an error in ArgoCD:
MountVolume.MountDevice failed for volume "playing-albatross-notebooks-storage" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name s3.csi.aws.com not found in the list of registered CSI drivers
/kind bug What happened? When using the Bottlerocket AMI with Karpenter NodeClass. Describing the pod, the events shows:
This error does not appear in when using AL2 AMI. However, even with the warning, I am still able to read data from the S3 mountpoint.
What you expected to happen? No warnings messages.
How to reproduce it (as minimally and precisely as possible)?
Anything else we need to know?:
Environment
kubectl version
): v1.28