Open drobison00 opened 3 years ago
Hmm, I'm not sure what the issue here is. In addition to running gsutil commands can you also try a few gcloud compute compute commands to create machines ? Perhaps you account does not have the correct perms to create compute instances ?
@quasiben Just checked, something like this works fine with gcloud compute
gcloud compute instances create drobisontest --project "<correct-project>" --machine-type "a2-highgpu-1g" --zone "us-central1-c" --image-family tf2-ent-2-3-cu110 --image-project deeplearning-platform-release --boot-disk-size 200GB --metadata "install-nvidia-driver=True,proxy-mode=project_editors" --scopes https://www.googleapis.com/auth/cloud-platform --maintenance-policy TERMINATE --restart-on-failure
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS
drobisontest us-central1-c a2-highgpu-1g ..... ..... RUNNING
I ran this recently on GCP. I was unable to reproduce the RefreshError: ('invalid_scope: Invalid OAuth scope or ID token audience provided.', '{"error":"invalid_scope","error_description":"Invalid OAuth scope or ID token audience provided."}')
error. Something may have fixed it.
However, I did run into similar issues as Issue #292 . On looking at the cloud-init-output.log
, it appears that the scheduler VM shuts down when trying to start the daskdev:dask:latest
docker image with the following error:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.
Further, I tried using a custom existing NGC image like the following :
from dask.distributed import Client, wait, get_worker
from dask_cloudprovider.gcp import GCPCluster
cluster = GCPCluster(projectid="nv-ai-infra",
machine_type="n1-standard-4",
zone="us-central1-a",
ngpus=1,
gpu_type="nvidia-tesla-v100",
n_workers=1,
source_image="projects/nvidia-ngc-public/global/images/nvidia-gpu-cloud-image-pytorch-20210609",
debug=True,
bootstrap=False,
silence_logs=False)
This fails with the same error. I would imagine passing a custom image which has NVIDIA drivers preinstalled would probably work. Is there such an image ?
Any of the RAPIDS images should be ok.
Just wanted to mention I ran into the same issue as @drobison00 . No need for me to paste the output. It's literally the same exact error.
RefreshError: ('invalid_scope: Invalid OAuth scope or ID token audience provided.', '{"error":"invalid_scope","error_description":"Invalid OAuth scope or ID token audience provided."}')
This is with the example from the docs:
from dask_cloudprovider.gcp import GCPCluster
cluster = GCPCluster(projectid=[PROJECT], machine_type="n1-standard-4", zone="us-east1-b")
client = Client(cluster)
The only way I got it to work was by:
Service Account User
IAM role to myself and the account (not sure if both were needed)I don't know if this issue is unique to dask though. I generally have OAuth token issues with several python libraries that try to use a subset of GCP services, particularly via the REST API. e.g. Google Sheets.
What happened: RefreshError: ('invalid_scope: Invalid OAuth scope or ID token audience provided.', '{"error":"invalid_scope","error_description":"Invalid OAuth scope or ID token audience provided."}')
What you expected to happen: Cluster creation to succeed
Minimal Complete Verifiable Example:
gcloud auth login
Environment: Conda environment