Open sbrunk opened 3 years ago
Hi @sbrunk , the GKE nodes have the driver installer images preloaded so you should just use daemonset-preloaded.yaml, and it should work. Can you paste the error you see when running it on a 1.18 cluster?
Also, based on https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers, on GKE 1.18, the default nvidia driver is already updated to 450.51.06, so you probably don't need this.
I just gave it another try using daemonset-preloaded.yaml in a fresh 1.18 cluster and it worked fine this time. So I must have done something wrong before.
Thanks for your help @ruiwen-zhao @pradvenkat and sorry for the noise.
I have had the same - just using daemonset-preloaded.yaml did not help. "Fun" stuff that I could download the file by link from the container on the node, but driver installation failed. You can say - download is not needed and drivers already there, but deployments with gpu in requested resources failed to be scheduled on the node even. I had to rollback to 1.17
I have had the same - just using daemonset-preloaded.yaml did not help. "Fun" stuff that I could download the file by link from the container on the node, but driver installation failed. You can say - download is not needed and drivers already there, but deployments with gpu in requested resources failed to be scheduled on the node even. I had to rollback to 1.17
Hi @lopeg thanks for bringing up the issue. Can you paste the error messages here? i.e. what did you see when the driver installation failed? And the GPU pods failed to be scheduled, is this because the node does not discover GPUs?
@ruiwen-zhao
Can you paste the error messages here?
I saw that deamonset pods were hanging in PodInitializing. Please find pod logs error on the screenshot At the same time, I was able to download the file from the busybox from the same node.
And the GPU pods failed to be scheduled, is this because the node does not discover GPUs?
I guess, yes, a deployment was immediately set to NotSchedulled because resources were not matched. The same YAML manifest successfully applied before the upgrade to 1.18 and after the downgrade to 1.17
@ruiwen-zhao We also ran into the exact same issue as @lopeg described after upgrading the cluster to 1.18, and the symptom was exactly the same. I'm wondering if there has any updates on the issue?
Hi @kakaxilyp-dawnlight and @lopeg Sorry for the late response. I have tried creating a GPU cluster with the same cluster version (1.18.12-gke.1210) but cannot reproduce the issue. The installer worked fine.
The GCS bucket cos-tools is public, so you should be able to access it. The only thing that might block your access is insufficient access scope. Can you please check the access scope of your node pools? You can do so by going to your Node Pool details page and check Access scopes under Security section. We want to make sure Storage Read access is there.
@ruiwen-zhao thanks for the quick reply. This is the service account for my gpu node pool and its scope, it seems ok to me:
I also quickly confirmed @lopeg's workaround, rolling back to 1.17 did make the issue disappear.
Yeah I was expecting to see the Access scopes field under Security. Something like this: .
Can you try creating a new cluster or a new node pool under the existing cluster, using default scopes (most importantly devstorage.read_only), and checking if the installer could be downloaded? See https://cloud.google.com/sdk/gcloud/reference/container/node-pools/create#--scopes
Hi, we have the same problem, also on newly created channels. There is a 403 Forbidden error when it tries to download the Nvidia drivers (which are publicly available).
failed to download GPU driver installer: failed to download GPU driver installer version 450.51.06: failed to download GPU driver installer, status: 403 Forbidden
We solved it by recreating the nodepool with the storage-ro scope. It looks like the default doesn't work with the new version. Is this a bug on GCP side?
@ruiwen-zhao Thanks for the example, but we were using the ContainerNodePool CRD to create node pools, and we didn't want to use the default compute engine service account for the node pools, looks like the access scope doesn't apply to non-default service account. Do you have any suggestion how to make the nvidia-driver-installer
works in this case?
@ruiwen-zhao Thanks for the example, but we were using the ContainerNodePool CRD to create node pools, and we didn't want to use the default compute engine service account for the node pools, looks like the access scope doesn't apply to non-default service account. Do you have any suggestion how to make the
nvidia-driver-installer
works in this case?
Driver installer needs to this storage read access scope, so you will need to add the scope to the node pool to make the installer work. Is it possible for you to recreate the node pool by specifying both service account and scope, as documented here? https://cloud.google.com/sdk/gcloud/reference/container/node-pools/create#--scopes
Sorry I am not familiar with ContainerNodePool CRD, but if ContainerNodePool doesnt allow you to specify access scope for non-default SA, then you could probably create a support case and have experts looking into it.
Hi, we have the same problem, also on newly created channels. There is a 403 Forbidden error when it tries to download the Nvidia drivers (which are publicly available).
failed to download GPU driver installer: failed to download GPU driver installer version 450.51.06: failed to download GPU driver installer, status: 403 Forbidden
We solved it by recreating the nodepool with the storage-ro scope. It looks like the default doesn't work with the new version. Is this a bug on GCP side?
There is a recent change on driver installer that requires Storage Read scope, and that change is rolled out with 1.18 clusters.
@ruiwen-zhao Creating a node pool with the default service account and storage read scope does solve the issue, but in our case we were using non-default service account, and looks like the access scope doesn't apply to node pools using non-default service accounts (not only about the ContainerNodePool CRD, I might be wrong, but wasn't able to find a clear answer on this). The non-default service account we used did have IAM roles with storage read access, but it didn't work for the installer. If I understand this correctly, does it mean that the latest driver installer won't work with node pools using non-default service accounts anymore?
There is a recent change on driver installer that requires Storage Read scope, and that change is rolled out with 1.18 clusters.
Understood. Shouldn't this new needed scope be the default? Or at least documented? I could not find any reference to this in the documentation and we had quite a headache figuring out this ;) We are using the default service account, but without the --scope flag during the nodepool creation it doesn't work.
@ruiwen-zhao Creating a node pool with the default service account and storage read scope does solve the issue, but in our case we were using non-default service account, and looks like the access scope doesn't apply to node pools using non-default service accounts (not only about the ContainerNodePool CRD, I might be wrong, but wasn't able to find a clear answer on this). The non-default service account we used did have IAM roles with storage read access, but it didn't work for the installer. If I understand this correctly, does it mean that the latest driver installer won't work with node pools using non-default service accounts anymore?
Access scope is on the VM level, and a more restrictive access scope will restrict a VM's access even if your service account may allow it. (See https://cloud.google.com/compute/docs/access/service-accounts#service_account_permissions).
Regarding to the support of access scope on non-default SA, access scope does apply to non-default SAs. I have run a quick test myself:
gcloud container clusters create non-default-sa-2 --zone us-central1-c --image-type "COS_CONTAINERD" --scopes "https://www.googleapis.com/auth/cloud-platform" --num-nodes 1 --service-account non-default-sa@ruiwen-gke-dev.iam.gserviceaccount.com
...
kubeconfig entry generated for non-default-sa-2.
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
non-default-sa-2 us-central1-c 1.18.12-gke.1210 34.121.161.29 e2-medium 1.18.12-gke.1210 1 RUNNING
Can you paste what error you saw when you apply the access scope to non-default SA? We might need to involve someone who's more familiar with this aspect if we can't solve the problem here.
For example, when trying to create node pools through UI console with non-default service account, it seems doesn't support access scope config?
For example, when trying to create node pools through UI console with non-default service account, it seems doesn't support access scope config?
Hi can you try doing this thru gcloud?
gcloud container node-pools create non-default-pool --zone us-central1-c --image-type "COS_CONTAINERD" --scopes "https://www.googleapis.com/auth/cloud-platform" --num-nodes 1 --service-account non-default-sa@ruiwen-gke-dev.iam.gserviceaccount.com --cluster non-default-sa-2
...
NAME MACHINE_TYPE DISK_SIZE_GB NODE_VERSION
non-default-pool e2-medium 100 1.18.12-gke.1210
@ruiwen-zhao Thanks for the example, just confirmed that using gcloud
to create a node pool with non-default service account and necessary access scopes worked fine with the installer on a 1.18
cluster, and we also figured out how to configure the scopes through CRDs. So looks like it's only the GCP console doesn't support it?
I have found that setting differs also between UI and terraform default auth_scopes in UI include read access to storage, while default values in terraform (and probably in cli) - does not. Man has to specify explicitly the following
oauth_scopes = [
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/service.management.readonly",
"https://www.googleapis.com/auth/servicecontrol",
"https://www.googleapis.com/auth/trace.append"
]
same issue :
what is the solution?
@ipclaudio the solution is to specify explicitly auth_scopes
Using the daemonset-nvidia-v450.yaml fails due to a 403 error in a cluster with version
1.18.14-gke.1200
. daemonset-preloaded.yaml works fine in an 1.17 cluster but also fails when using an 1.18 cluster.I've only captured the log of the v450 installer: