FIRST-Tech-Challenge / fmltc

FIRST Machine Learning Toolchain
Other
39 stars 14 forks source link

Can't train model, no K80 accelerators #318

Open acharraggi opened 1 year ago

acharraggi commented 1 year ago

When I click the Start Training button for a dataset I get error "The request for 1 K80 accelerators exceeds the allowed maximum of 0".

I've tried with my App Engine in us-east1 and in us-west1.

I've seen the Google page that mention what zones in a region have K80's, but I don't see any way to specify a zone, just a region. And no region has K80's in all zones.

So how do I make this work? Do you have this working somewhere, which region?

Or is there a way to force the training to occur in a specific region and zone?

acharraggi commented 1 year ago

I just tried again in asia-east1 and it failed again. Does anyone have this working this year?

I think something is directing it to a training region that no longer has K80 quota, see the JSON blob below from my Logs explorer that shows the request with "region": "us-central1".

I never choose us-central1 for any part of this project, so this is something internal to the project. Maybe hidden in the docker image.

{ "protoPayload": { "@type": "type.googleapis.com/google.cloud.audit.AuditLog", "status": { "code": 8, "message": "Quota failure for project fmltc-400415. The request for 1 K80 accelerators exceeds the allowed maximum of 0 A100, 0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V2_POD, 0 TPU_V3, 0 TPU_V3_POD, 0 V100 accelerators. To read more about Cloud ML Engine quota, see https://cloud.google.com/ml-engine/quotas." }, "authenticationInfo": { "principalEmail": "fmltc-400415-service-account@fmltc-400415.iam.gserviceaccount.com", "serviceAccountKeyName": "//iam.googleapis.com/projects/fmltc-400415/serviceAccounts/fmltc-400415-service-account@fmltc-400415.iam.gserviceaccount.com/keys/e36e85906e35a04ed01dbb7f89bf9640755c149c" }, "requestMetadata": { "callerIp": "2600:1900:2000:45::1:1c01", "callerSuppliedUserAgent": "(gzip),gzip(gfe)", "requestAttributes": { "time": "2023-09-29T13:47:16.789373Z", "auth": {} }, "destinationAttributes": {} }, "serviceName": "ml.googleapis.com", "methodName": "google.cloud.ml.v1.JobService.CreateJob", "authorizationInfo": [ { "resource": "projects/fmltc-400415", "permission": "ml.jobs.create", "granted": true, "resourceAttributes": {} } ], "resourceName": "projects/fmltc-400415", "request": { "@type": "type.googleapis.com/google.cloud.ml.v1.CreateJobRequest", "job": { "jobId": "eval_9531170dc309491094f79d747f786dc9", "trainingInput": { "args": [ "--model_dir", "gs://fmltc-400415-blobs/2023_2024/models/ca1488414f7b491ea72b2ec87b973275/9531170dc309491094f79d747f786dc9", "--pipeline_config_path", "gs://fmltc-400415-blobs/2023_2024/models/ca1488414f7b491ea72b2ec87b973275/9531170dc309491094f79d747f786dc9/pipeline.config", "--checkpoint_dir", "gs://fmltc-400415-blobs/2023_2024/models/ca1488414f7b491ea72b2ec87b973275/9531170dc309491094f79d747f786dc9" ], "jobDir": "gs://fmltc-400415-blobs/2023_2024/models/ca1488414f7b491ea72b2ec87b973275/9531170dc309491094f79d747f786dc9", "scaleTier": "BASIC_GPU", "masterConfig": { "imageUri": "gcr.io/fmltc-400415/object_detection:2021_11_25" }, "region": "us-central1" } }, "parent": "projects/fmltc-400415" }, "resourceLocation": { "currentLocations": [ "us-central1" ] } }, "insertId": "1rjwqa0d1cfo", "resource": { "type": "audited_resource", "labels": { "project_id": "fmltc-400415", "method": "google.cloud.ml.v1.JobService.CreateJob", "service": "ml.googleapis.com" } }, "timestamp": "2023-09-29T13:47:16.772148Z", "severity": "ERROR", "logName": "projects/fmltc-400415/logs/cloudaudit.googleapis.com%2Factivity", "receiveTimestamp": "2023-09-29T13:47:17.694699358Z" }

acharraggi commented 11 months ago

Still no success. I replaced all the us-central references I could find anywhere in the fmltc files with asia-east1 and a couple of zone references to asia-east1-b since that region and that particular zone has lots K80's. But it still fails.

I suspect something else is trying to run the machine learning in a region/zone that doesn't have K80's. Or the K80's that it's trying to use are NOT part of the quota's that I'm seeing that should be available (since some seem to be dedicated to various uses of the K80's).

Probably something in the Docker image deployment, but I can see anything else to change.

I had hoped to get this working for a season start event that's already passed, and I've got a workshop upcoming at month end. I don't want to use up any team's allocation of training hours since this is not team specific work.

Anyone got this working this for the Centerstage season?

texasdiaz commented 11 months ago

If you're doing a workshop, contact me at ftctech@firstinspires.org and we can talk. Subject to approval, we can credit your account for the time you use for the workshop (talk to us first before doing this).