awslabs / kubeflow-manifests

KubeFlow on AWS
https://awslabs.github.io/kubeflow-manifests/
Apache License 2.0
172 stars 123 forks source link

EFS does not work for terraform deployment when using built-in EFS driver install #717

Closed AlexandreBrown closed 1 year ago

AlexandreBrown commented 1 year ago

Describe the bug If we follow the doc and use the manual step (or in my case I modified the auto efs script to only install the file system and create the storageclass), EFS creation succeeds but when creating a test notebook the volume is in pending state forever.

Events:
  Type    Reason                Age               From                         Message
  ----    ------                ----              ----                         -------
  Normal  WaitForFirstConsumer  72s               persistentvolume-controller  waiting for first consumer to be created before binding
  Normal  ExternalProvisioning  2s (x7 over 70s)  persistentvolume-controller  waiting for a volume to be created, either by external provisioner "efs.csi.aws.com" or manually created by system administrator

Steps To Reproduce Deploy EFS using the auto setup script trimmed to the equivalent of the manual steps for terraform deployment :

def main():
    header()

    verify_prerequisites()

    setup_efs_file_system()
    setup_efs_provisioning()

    footer()

Environment

Screenshots image

ananth102 commented 1 year ago

Hi AlexandreBrown, does your oidc provider have the alpha.eksctl.io/cluster-name tag and is there anything interesting that you see in the efs csi driver logs or on cloudtrail(related to efs). We also recommend following the manual steps for terraform.

AlexandreBrown commented 1 year ago

@ryansteakley My OIDC provider (created via terraform deployment I suppose) has the following tag :
image
I added the tag :
image

But it did not change anything :

  Events:
  Type    Reason                Age                From                         Message
  ----    ------                ----               ----                         -------
  Normal  WaitForFirstConsumer  41s                persistentvolume-controller  waiting for first consumer to be created before binding
  Normal  ExternalProvisioning  13s (x3 over 39s)  persistentvolume-controller  waiting for a volume to be created, either by external provisioner "efs.csi.aws.com" or manually created by system administrator

We also recommend following the manual steps for terraform.

I modified the auto script to only keep the parts that create the file system (steps that matches the manual steps).
I'm not sure why that would not work.

import argparse
import boto3
import subprocess
import string
import random
import yaml
from shutil import which
from time import sleep

def main():
    header()

    verify_prerequisites()

    setup_efs_file_system()
    setup_efs_provisioning()

    footer()

...
AlexandreBrown commented 1 year ago

@ryansteakley From my comprehension, the doc says we have to skip the entire step 1. (so step 1.1 and 1.2) since the text is below 1.
Is this correct or did it meant to say only skip 1.1?
image

AlexandreBrown commented 1 year ago

@ryansteakley After further testing it looks like the only way I could get EFS to work was to use the auto script (no manual steps and no skipping of the CSI driver install).
Maybe the driver installed by terraform is not being used or detected? It works with the auto script (untouched from the repo) but it does not work when I do all the steps but the driver install.

The following worked (snippet of my dockerfile):

RUN OIDC_ID=$(aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d "/" -f5) \
    && AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text) \
    && aws iam tag-open-id-connect-provider \
        --open-id-connect-provider-arn "arn:aws:iam::$AWS_ACCOUNT_ID:oidc-provider/oidc.eks.$CLUSTER_REGION.amazonaws.com/id/$OIDC_ID" \
        --tags Key="alpha.eksctl.io/cluster-name",Value="${CLUSTER_NAME}" \
    && python utils/auto-efs-setup.py \
        --region $CLUSTER_REGION \
        --cluster $CLUSTER_NAME \
        --efs_file_system_name $EFS_FILE_SYSTEM_NAME \
        --efs_security_group_name $EFS_SECURITY_GROUP_NAME \
        --efs_throughput_mode elastic \
    && kubectl patch storageclass gp2 -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}' \
    && kubectl patch storageclass efs-sc -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'