aws-quickstart / cdk-eks-blueprints

AWS Quick Start Team
Apache License 2.0
446 stars 198 forks source link

EfsCsiDriverAddOn: mount.nfs4: access denied by server while mounting 127.0.0.1:/ #1052

Open JonVDB opened 1 month ago

JonVDB commented 1 month ago

Describe the bug

When deploying a StorageClass, PersistentVolumeClaim and Pod while using the EfsCsiDriverAddOn to dynamically provision an EFS Access Point and mount it to the Pod, mounting fails with the error mount.nfs4: access denied by server while mounting 127.0.0.1:/.

Expected Behavior

Mounting the EFS Access Point to the Pod succeeds.

Current Behavior

Running kubectl describe pod/efs-app shows the following Event logs for the Pod:

Name:             efs-app
Namespace:        default
Priority:         0
Service Account:  default
Node:             ip-XXX-XXX-XXX-XXX.eu-west-1.compute.internal/XXX.XXX.XXX.XXX
Start Time:       Thu, 01 Aug 2024 13:08:05 +0200
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Containers:
  app:
    Container ID:
    Image:         centos
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      while true; do echo $(date -u) >> /data/out; sleep 5; done
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /data from persistent-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zg68d (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  persistent-storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  efs-claim
    ReadOnly:   false
  kube-api-access-zg68d:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age               From               Message
  ----     ------       ----              ----               -------
  Normal   Scheduled    18s               default-scheduler  Successfully assigned default/efs-app to ip-XXX-XXX-XXX-XXX.eu-west-1.compute.internal
  Warning  FailedMount  8s (x5 over 17s)  kubelet            MountVolume.SetUp failed for volume "pvc-XXXXXXX" : rpc error: code = Internal desc = Could not mount "fs-XXXXXXX:/" at "/var/lib/kubelet/pods/XXXXXXX/volumes/kubernetes.io~csi/pvc-XXXXXXX/mount": mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-XXXXXXX,tls fs-XXXXXXX:/ /var/lib/kubelet/pods/XXXXXXX/volumes/kubernetes.io~csi/pvc-XXXXXXX/mount
Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"
b'mount.nfs4: access denied by server while mounting 127.0.0.1:/'
Warning: config file does not have fips_mode_enabled item in section mount.. You should be able to find a new config file in the same folder as current config file /etc/amazon/efs/efs-utils.conf. Consider update the new config file to latest config file. Use the default value [fips_mode_enabled = False].Warning: config file does not have retry_nfs_mount_command item in section mount.. You should be able to find a new config file in the same folder as current config file /etc/amazon/efs/efs-utils.conf. Consider update the new config file to latest config file. Use the default value [retry_nfs_mount_command = True].

However, the creation of the EFS Access Point does succeed, as seen in the AWS Console and via command kubectl describe pvc/efs-claim:

Name:          efs-claim
Namespace:     default
StorageClass:  efs-sc
Status:        Bound
Volume:        pvc-XXXXXXX
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: efs.csi.aws.com
               volume.kubernetes.io/storage-provisioner: efs.csi.aws.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      5Gi
Access Modes:  RWX
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type    Reason                 Age   From                                                                                      Message
  ----    ------                 ----  ----                                                                                      -------
  Normal  ExternalProvisioning   17s   persistentvolume-controller                                                               Waiting for a volume to be created either by the external provisioner 'efs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal  Provisioning           17s   efs.csi.aws.com_efs-csi-controller-XXXXXXX  External provisioner is provisioning volume for claim "default/efs-claim"
  Normal  ProvisioningSucceeded  17s   efs.csi.aws.com_efs-csi-controller-XXXXXXX  Successfully provisioned volume pvc-XXXXXXX

Then the details of the Storage Class, by running command kubectl describe sc/efs-sc:

Name:            efs-sc
IsDefaultClass:  No
Annotations:     kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{},"name":"efs-sc"},"parameters":{"basePath":"/dynamic_provisioning","directoryPerms":"700","ensureUniqueDirectory":"true","fileSystemId":"fs-XXXXXXX","gidRangeEnd":"2000","gidRangeStart":"1000","provisioningMode":"efs-ap","reuseAccessPoint":"false","subPathPattern":"${.PVC.namespace}/${.PVC.name}"},"provisioner":"efs.csi.aws.com"}

Provisioner:           efs.csi.aws.com
Parameters:            basePath=/dynamic_provisioning,directoryPerms=700,ensureUniqueDirectory=true,fileSystemId=fs-XXXXXXX,gidRangeEnd=2000,gidRangeStart=1000,provisioningMode=efs-ap,reuseAccessPoint=false,subPathPattern=${.PVC.namespace}/${.PVC.name}
AllowVolumeExpansion:  <unset>
ReclaimPolicy:      Delete
VolumeBindingMode:  Immediate
Events:             <none>

Lastly I have also checked the efs-csi-controller logs using command kubectl logs deployment/efs-csi-controller -n kube-system -c efs-plugin:

I0801 10:38:43.866497       1 config_dir.go:63] Mounted directories do not exist, creating directory at '/etc/amazon/efs'
I0801 10:38:43.867231       1 metadata.go:65] getting MetadataService...
I0801 10:38:43.868837       1 metadata.go:70] retrieving metadata from EC2 metadata service
I0801 10:38:43.871827       1 driver.go:150] Did not find any input tags.
I0801 10:38:43.872040       1 driver.go:116] Registering Node Server
I0801 10:38:43.872062       1 driver.go:118] Registering Controller Server
I0801 10:38:43.872074       1 driver.go:121] Starting efs-utils watchdog
I0801 10:38:43.872155       1 efs_watch_dog.go:216] Copying /etc/amazon/efs/efs-utils.conf since it doesn't exist
I0801 10:38:43.872242       1 efs_watch_dog.go:216] Copying /etc/amazon/efs/efs-utils.crt since it doesn't exist
I0801 10:38:43.873827       1 driver.go:127] Starting reaper
I0801 10:38:43.883901       1 driver.go:137] Listening for connections on address: &net.UnixAddr{Name:"/var/lib/csi/sockets/pluginproxy/csi.sock", Net:"unix"}
I0801 11:07:24.454475       1 controller.go:286] Using user-specified structure for access point directory.
I0801 11:07:24.454501       1 controller.go:292] Appending PVC UID to path.
I0801 11:07:24.454523       1 controller.go:310] Using /dynamic_provisioning/default/efs-claim-XXXXXXX as the access point directory. 

Reproduction Steps

  1. Deploy an EKS Blueprints stack with only the EfsCsiDriverAddOn and a VPC Resource Provider and an EFS Resource Provider.
    
    // lib/stack.ts
    import * as cdk from 'aws-cdk-lib';
    import { Construct } from 'constructs';
    import {
    EksBlueprint,
    ClusterAddOn,
    EfsCsiDriverAddOn,
    GlobalResources,
    VpcProvider,
    CreateEfsFileSystemProvider,
    } from '@aws-quickstart/eks-blueprints';

export default class ClusterConstruct extends Construct { constructor(scope: Construct, id: string, props?: cdk.StackProps) { super(scope, id);

const account = props?.env?.account!;
const region = props?.env?.region!;

const addOns: Array<ClusterAddOn> = [
  new EfsCsiDriverAddOn({
    replicaCount: 1
  }),
];

const blueprint = EksBlueprint.builder()
  .version('auto')
  .account(account)
  .region(region)
  .resourceProvider(GlobalResources.Vpc, new VpcProvider())
  .resourceProvider("efs-file-system", new CreateEfsFileSystemProvider({name: "efs-file-system"}))
  .addOns(...addOns)
  .build(scope, id + '-eks-efs-poc');

} }

2. Deploy the StorageClass (with updated FS-id), PersistentStorageClaim and Pod from the [official aws-efs-csi-driver repository dynamic_provisioning example](https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/examples/kubernetes/dynamic_provisioning/specs). I do this one by one in the mentioned order.
3. Check the mount status of the Pod with `kubectl describe pod/efs-app`. Optionally also check the PVC and SC with `kubectl describe pvc/efs-claim` and `kubectl describe sc/efs-sc`.

### Possible Solution

Not sure.

### Additional Information/Context

I have done the following troubleshooting, which all result in the same error:
- Manually added `mountOptions` to the provided StorageClass with `iam` and `tls` included. This was suggested in [this AWS re:Post](https://repost.aws/knowledge-center/eks-troubleshoot-efs-volume-mount-issues).
```YAML
kind: StorageClass
...
mountOptions:
  - tls
  - iam
...

const nodeRole = new CreateRoleProvider("blueprint-node-role", new cdk.aws_iam.ServicePrincipal("ec2.amazonaws.com"), [ cdk.aws_iam.ManagedPolicy.fromAwsManagedPolicyName("AmazonEKSWorkerNodePolicy"), cdk.aws_iam.ManagedPolicy.fromAwsManagedPolicyName("AmazonEC2ContainerRegistryReadOnly"), cdk.aws_iam.ManagedPolicy.fromAwsManagedPolicyName("AmazonSSMManagedInstanceCore"), cdk.aws_iam.ManagedPolicy.fromAwsManagedPolicyName("AmazonEKS_CNI_Policy"), cdk.aws_iam.ManagedPolicy.fromAwsManagedPolicyName("CloudWatchAgentServerPolicy"), cdk.aws_iam.ManagedPolicy.fromAwsManagedPolicyName("AmazonElasticFileSystemClientReadWriteAccess"), // <- ]); const mngProps: MngClusterProviderProps = { version: cdk.aws_eks.KubernetesVersion.of('auto'), instanceTypes: [new cdk.aws_ec2.InstanceType("m5.xlarge")], amiType: cdk.aws_eks.NodegroupAmiType.AL2_X86_64, nodeRole: getNamedResource("node-role") as cdk.aws_iam.Role, desiredSize: 2, maxSize: 3, };

// ...

const blueprint = EksBlueprint.builder() // ... .clusterProvider(new MngClusterProvider(mngProps)) // <- .resourceProvider("node-role", nodeRole) // <- // ...


- Checked if EFS CSI Driver provisions the EFS Access Point correctly, which it does.
- Checked the EFS File System Policy, which looks alright.
- Checked if EFS is in the same VPC as the EKS Cluster, which it is.
- Checked if EFS Security Groups allow inbound NFS:2049 traffic, which it does. 

### CDK CLI Version

2.133.0 (build dcc1e75)

### EKS Blueprints Version

1.15.1

### Node.js Version

v20.11.0

### Environment details (OS name and version, etc.)

Win11Pro22H2

### Other information

While I'm uncertain of the exact cause, I assume it is IAM related.
I found a similar issue on the EKS Blueprints for Terraform repository (https://github.com/aws-ia/terraform-aws-eks-blueprints/issues/1171), which has been solved (https://github.com/aws-ia/terraform-aws-eks-blueprints/pull/1191). Perhaps this has a similar cause? I believe it might be related because the mount-option mention in the fix does not seem to be included in the mount-command in the above EKS Blueprints for CDK logs (specifically the Pod Event logs).
shapirov103 commented 1 month ago

@JonVDB please check the content on EFS filesystem and EFS addon in our workshop for security patterns in EKS here: https://catalog.us-east-1.prod.workshops.aws/workshops/90c9d1eb-71a1-4e0e-b850-dba04ae92887/en-US/security/065-data-encryption/1-stack-setup

You will see steps and policies to configure your EFS filesystem with e2e encryption. Please let me know if that solves the issue, we can then update the docs with that reference.

JonVDB commented 1 month ago

@shapirov103 Hey, I wasn't aware that there was a Workshop for the EFS CSI Driver. I've only used the QuickStart docs. The instructions in the Workshop work perfectly! Issue solved. Thank you!