Downloading driver fails on a K8S 1.18 GKE Cluster

sbrunk commented 3 years ago

Using the daemonset-nvidia-v450.yaml fails due to a 403 error in a cluster with version 1.18.14-gke.1200. daemonset-preloaded.yaml works fine in an 1.17 cluster but also fails when using an 1.18 cluster.

I've only captured the log of the v450 installer:

+ COS_KERNEL_INFO_FILENAME=kernel_info
+ COS_KERNEL_SRC_HEADER=kernel-headers.tgz
+ TOOLCHAIN_URL_FILENAME=toolchain_url
+ TOOLCHAIN_ENV_FILENAME=toolchain_env
+ TOOLCHAIN_PKG_DIR=/build/cos-tools
+ CHROMIUMOS_SDK_GCS=https://storage.googleapis.com/chromiumos-sdk
+ ROOT_OS_RELEASE=/root/etc/os-release
+ KERNEL_SRC_HEADER=/build/usr/src/linux
+ NVIDIA_DRIVER_VERSION=450.51.06
+ NVIDIA_DRIVER_MD5SUM=
+ NVIDIA_INSTALL_DIR_HOST=/home/kubernetes/bin/nvidia
+ NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
+ ROOT_MOUNT_DIR=/root
+ CACHE_FILE=/usr/local/nvidia/.cache
+ LOCK_FILE=/root/tmp/cos_gpu_installer_lock
+ LOCK_FILE_FD=20
+ set +x
[INFO    2021-01-22 22:11:36 UTC] PRELOAD: false
[INFO    2021-01-22 22:11:36 UTC] Running on COS build id 13310.1041.38
[INFO    2021-01-22 22:11:36 UTC] Data dependencies (e.g. kernel source) will be fetched from https://storage.googleapis.com/cos-tools/13310.1041.38
[INFO    2021-01-22 22:11:36 UTC] Checking if this is the only cos-gpu-installer that is running.
[INFO    2021-01-22 22:11:36 UTC] Checking if third party kernel modules can be installed
/tmp/esp /
/
[INFO    2021-01-22 22:11:36 UTC] Checking cached version
[INFO    2021-01-22 22:11:36 UTC] Cache file /usr/local/nvidia/.cache not found.
[INFO    2021-01-22 22:11:36 UTC] Did not find cached version, building the drivers...
[INFO    2021-01-22 22:11:36 UTC] Downloading GPU installer ...
/usr/local/nvidia /
[INFO    2021-01-22 22:11:37 UTC] Downloading from https://storage.googleapis.com/nvidia-drivers-eu-public/nvidia-cos-project/85/tesla/450_00/450.51.06/NVIDIA-Linux-x86_64-450.51.06_85-13310-1041-38.cos
[INFO    2021-01-22 22:11:37 UTC] Downloading GPU installer from https://storage.googleapis.com/nvidia-drivers-eu-public/nvidia-cos-project/85/tesla/450_00/450.51.06/NVIDIA-Linux-x86_64-450.51.06_85-13310-1041-38.cos
curl: (22) The requested URL returned error: 403

ruiwen-zhao commented 3 years ago

Hi @sbrunk , the GKE nodes have the driver installer images preloaded so you should just use daemonset-preloaded.yaml, and it should work. Can you paste the error you see when running it on a 1.18 cluster?

pradvenkat commented 3 years ago

Also, based on https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers, on GKE 1.18, the default nvidia driver is already updated to 450.51.06, so you probably don't need this.

sbrunk commented 3 years ago

I just gave it another try using daemonset-preloaded.yaml in a fresh 1.18 cluster and it worked fine this time. So I must have done something wrong before.

Thanks for your help @ruiwen-zhao @pradvenkat and sorry for the noise.

lopeg commented 3 years ago

I have had the same - just using daemonset-preloaded.yaml did not help. "Fun" stuff that I could download the file by link from the container on the node, but driver installation failed. You can say - download is not needed and drivers already there, but deployments with gpu in requested resources failed to be scheduled on the node even. I had to rollback to 1.17

ruiwen-zhao commented 3 years ago

I have had the same - just using daemonset-preloaded.yaml did not help. "Fun" stuff that I could download the file by link from the container on the node, but driver installation failed. You can say - download is not needed and drivers already there, but deployments with gpu in requested resources failed to be scheduled on the node even. I had to rollback to 1.17

Hi @lopeg thanks for bringing up the issue. Can you paste the error messages here? i.e. what did you see when the driver installation failed? And the GPU pods failed to be scheduled, is this because the node does not discover GPUs?

lopeg commented 3 years ago

@ruiwen-zhao

Can you paste the error messages here?

I saw that deamonset pods were hanging in PodInitializing. Please find pod logs error on the screenshot At the same time, I was able to download the file from the busybox from the same node.

And the GPU pods failed to be scheduled, is this because the node does not discover GPUs?

I guess, yes, a deployment was immediately set to NotSchedulled because resources were not matched. The same YAML manifest successfully applied before the upgrade to 1.18 and after the downgrade to 1.17

kakaxilyp commented 3 years ago

@ruiwen-zhao We also ran into the exact same issue as @lopeg described after upgrading the cluster to 1.18, and the symptom was exactly the same. I'm wondering if there has any updates on the issue?

ruiwen-zhao commented 3 years ago

Hi @kakaxilyp-dawnlight and @lopeg Sorry for the late response. I have tried creating a GPU cluster with the same cluster version (1.18.12-gke.1210) but cannot reproduce the issue. The installer worked fine.

The GCS bucket cos-tools is public, so you should be able to access it. The only thing that might block your access is insufficient access scope. Can you please check the access scope of your node pools? You can do so by going to your Node Pool details page and check Access scopes under Security section. We want to make sure Storage Read access is there.

kakaxilyp commented 3 years ago

@ruiwen-zhao thanks for the quick reply. This is the service account for my gpu node pool and its scope, it seems ok to me:

I also quickly confirmed @lopeg's workaround, rolling back to 1.17 did make the issue disappear.

ruiwen-zhao commented 3 years ago

Yeah I was expecting to see the Access scopes field under Security. Something like this: Screen Shot 2021-02-23 at 2 39 28 PM .

Can you try creating a new cluster or a new node pool under the existing cluster, using default scopes (most importantly devstorage.read_only), and checking if the installer could be downloaded? See https://cloud.google.com/sdk/gcloud/reference/container/node-pools/create#--scopes

k3rn31 commented 3 years ago

Hi, we have the same problem, also on newly created channels. There is a 403 Forbidden error when it tries to download the Nvidia drivers (which are publicly available).

failed to download GPU driver installer: failed to download GPU driver installer version 450.51.06: failed to download GPU driver installer, status: 403 Forbidden

We solved it by recreating the nodepool with the storage-ro scope. It looks like the default doesn't work with the new version. Is this a bug on GCP side?

kakaxilyp commented 3 years ago

@ruiwen-zhao Thanks for the example, but we were using the ContainerNodePool CRD to create node pools, and we didn't want to use the default compute engine service account for the node pools, looks like the access scope doesn't apply to non-default service account. Do you have any suggestion how to make the nvidia-driver-installer works in this case?

ruiwen-zhao commented 3 years ago

@ruiwen-zhao Thanks for the example, but we were using the ContainerNodePool CRD to create node pools, and we didn't want to use the default compute engine service account for the node pools, looks like the access scope doesn't apply to non-default service account. Do you have any suggestion how to make the nvidia-driver-installer works in this case?

Driver installer needs to this storage read access scope, so you will need to add the scope to the node pool to make the installer work. Is it possible for you to recreate the node pool by specifying both service account and scope, as documented here? https://cloud.google.com/sdk/gcloud/reference/container/node-pools/create#--scopes

Sorry I am not familiar with ContainerNodePool CRD, but if ContainerNodePool doesnt allow you to specify access scope for non-default SA, then you could probably create a support case and have experts looking into it.

ruiwen-zhao commented 3 years ago

Hi, we have the same problem, also on newly created channels. There is a 403 Forbidden error when it tries to download the Nvidia drivers (which are publicly available).

failed to download GPU driver installer: failed to download GPU driver installer version 450.51.06: failed to download GPU driver installer, status: 403 Forbidden

We solved it by recreating the nodepool with the storage-ro scope. It looks like the default doesn't work with the new version. Is this a bug on GCP side?

There is a recent change on driver installer that requires Storage Read scope, and that change is rolled out with 1.18 clusters.

kakaxilyp commented 3 years ago

@ruiwen-zhao Creating a node pool with the default service account and storage read scope does solve the issue, but in our case we were using non-default service account, and looks like the access scope doesn't apply to node pools using non-default service accounts (not only about the ContainerNodePool CRD, I might be wrong, but wasn't able to find a clear answer on this). The non-default service account we used did have IAM roles with storage read access, but it didn't work for the installer. If I understand this correctly, does it mean that the latest driver installer won't work with node pools using non-default service accounts anymore?

k3rn31 commented 3 years ago

There is a recent change on driver installer that requires Storage Read scope, and that change is rolled out with 1.18 clusters.

Understood. Shouldn't this new needed scope be the default? Or at least documented? I could not find any reference to this in the documentation and we had quite a headache figuring out this ;) We are using the default service account, but without the --scope flag during the nodepool creation it doesn't work.

ruiwen-zhao commented 3 years ago

@ruiwen-zhao Creating a node pool with the default service account and storage read scope does solve the issue, but in our case we were using non-default service account, and looks like the access scope doesn't apply to node pools using non-default service accounts (not only about the ContainerNodePool CRD, I might be wrong, but wasn't able to find a clear answer on this). The non-default service account we used did have IAM roles with storage read access, but it didn't work for the installer. If I understand this correctly, does it mean that the latest driver installer won't work with node pools using non-default service accounts anymore?

Access scope is on the VM level, and a more restrictive access scope will restrict a VM's access even if your service account may allow it. (See https://cloud.google.com/compute/docs/access/service-accounts#service_account_permissions).

Regarding to the support of access scope on non-default SA, access scope does apply to non-default SAs. I have run a quick test myself:

gcloud  container clusters create non-default-sa-2   --zone us-central1-c   --image-type "COS_CONTAINERD"   --scopes "https://www.googleapis.com/auth/cloud-platform"    --num-nodes 1 --service-account non-default-sa@ruiwen-gke-dev.iam.gserviceaccount.com
...
kubeconfig entry generated for non-default-sa-2.
NAME              LOCATION       MASTER_VERSION    MASTER_IP      MACHINE_TYPE  NODE_VERSION      NUM_NODES  STATUS
non-default-sa-2  us-central1-c  1.18.12-gke.1210  34.121.161.29  e2-medium     1.18.12-gke.1210  1          RUNNING

Can you paste what error you saw when you apply the access scope to non-default SA? We might need to involve someone who's more familiar with this aspect if we can't solve the problem here.

kakaxilyp commented 3 years ago

For example, when trying to create node pools through UI console with non-default service account, it seems doesn't support access scope config?

ruiwen-zhao commented 3 years ago

For example, when trying to create node pools through UI console with non-default service account, it seems doesn't support access scope config?

Hi can you try doing this thru gcloud?

gcloud container node-pools create non-default-pool --zone us-central1-c   --image-type "COS_CONTAINERD"   --scopes "https://www.googleapis.com/auth/cloud-platform"  --num-nodes 1 --service-account non-default-sa@ruiwen-gke-dev.iam.gserviceaccount.com --cluster non-default-sa-2
...
NAME              MACHINE_TYPE  DISK_SIZE_GB  NODE_VERSION
non-default-pool  e2-medium     100           1.18.12-gke.1210

kakaxilyp commented 3 years ago

@ruiwen-zhao Thanks for the example, just confirmed that using gcloud to create a node pool with non-default service account and necessary access scopes worked fine with the installer on a 1.18 cluster, and we also figured out how to configure the scopes through CRDs. So looks like it's only the GCP console doesn't support it?

lopeg commented 3 years ago

I have found that setting differs also between UI and terraform default auth_scopes in UI include read access to storage, while default values in terraform (and probably in cli) - does not. Man has to specify explicitly the following

        oauth_scopes = [
            "https://www.googleapis.com/auth/devstorage.read_only",
            "https://www.googleapis.com/auth/logging.write",
            "https://www.googleapis.com/auth/monitoring",
            "https://www.googleapis.com/auth/service.management.readonly",
            "https://www.googleapis.com/auth/servicecontrol",
            "https://www.googleapis.com/auth/trace.append"
        ]

ipclaudio commented 3 years ago

same issue :

what is the solution?

lopeg commented 3 years ago

@ipclaudio the solution is to specify explicitly auth_scopes

GoogleCloudPlatform / container-engine-accelerators

Downloading driver fails on a K8S 1.18 GKE Cluster #177