Closed DaveWhitmer closed 2 years ago
I'm not sure what the problem is, but you might be able to get around it by running your jobs on GPU instead of TPU. They'll run slower, but it might work better.
To do that, go to https://console.cloud.google.com/firestore/data?project=
I'll investigate the problem with TPU.
I can't reproduce the problem on my own GCP project. Is this something that happens all the time on your GCP project? Can you post more of the logs? Is there a stack trace near the "Module raised an exception for failing to TPU node IP" error?
It did not happen when I first tried it. I successfully built a model.
All I see is a warning and a deprecation notice other than these errors. I'll try the GPU and let you know.
Is your repo up-to-date? On November 25th I merged a pull request that fixed some problems we were having with training jobs failing.
That pull request makes it so for training on GPU, we use a docker image. So the README was changed to add steps for deploying the docker image to Google's Container Registry.
That pull request also changes the code we use for training on TPU, although it doesn't use a docker image; it uses a tar.gz file.
I suggest you update your repo to the latest and then rerun all of the deploy_... scripts.
Thank you. I should have thought to check that. I'll get that done and report back.
I can open a new issue if you would like. But after going through a completely fresh install of the latest release to a new project I can not upload videos to the new install. I get the following error: I posted the error in the log down below related to max instances. In researching this error, I checked and there is not a max instances set. Also, the old instance with the previous version I was running still uploads videos just fine.
{ "textPayload": "The request was aborted because there was no available instance. Additional troubleshooting documentation can be found at: https://cloud.google.com/functions/docs/troubleshooting#scalability", "insertId": "61c71faf000625be5559db07", "httpRequest": { "requestMethod": "POST", "requestUrl": "https://bdff49c523a46334a80dbf38f8ac5420-dot-t5fce678af9622bd4p-tp.appspot.com/_ah/push-handlers/pubsub/projects/t5fce678af9622bd4p-tp/topics/cloud-functions-gip7atb43krcpn4pebqinfkkuq", "requestSize": "2422", "status": 500, "userAgent": "CloudPubSub-Google", "remoteIp": "2002:a05:6681:321a:b0:b6:249d:9acf", "latency": "0s", "protocol": "HTTP/1.1" }, "resource": { "type": "cloud_function", "labels": { "project_id": "ftc18053-ml-2", "function_name": "perform_action", "region": "us-central1" } }, "timestamp": "2021-12-25T13:42:07.402878Z", "severity": "ERROR", "logName": "projects/ftc18053-ml-2/logs/cloudfunctions.googleapis.com%2Frequests", "trace": "projects/ftc18053-ml-2/traces/97a7e1ab9517f69f7e6c787a7fc0d482", "receiveTimestamp": "2021-12-25T13:42:07.406309207Z" }
Going to reinstall and try and put in a different region. Old instance was using us-central and new one was using us-east4. I'll report back.
I've done 4 completely fresh tries and I cannot get another working version. I get this error on the deploy_cloud_function:
When I inspect the logs though, there are not errors and it says the build succeeded.
Just to make sure, I still completed the remaining instructions and I get a 502 Bad Gateway error when accessing the site.
Can you check your app engine logs? Usually with a 502, there is a python stack trace logged. That stack trace might also be what's preventing the cloud function from deploying.
Also, can you do git log and git status and make sure you are at commit 53ef4cb3f30f095938ea8d607dccf9dfbe4b73f9 and you don't have any local changes?
I just deployed everything (to my existing google cloud project, not a new one) and it all worked fine.
Just got it all working. I found the error that was preventing the redeployment. Posting it here for anyone else that may have the issue. Even though I did it right the first time, I kept changing the editor role to the secrets accessor role instead of adding it when I was trying to deploy to a different region. After deploying to Oregon instead of Virginia, I can now upload a video. Still can't do it to the Virginia project, so deleting that one.
And now have gotten all the way through producing a model. Thank you so much for the help. This is such a great project. Can't wait to watch the kids with it!
Oh, that is excellent!!!
Jobs are failing on our model training for us with 2 noticeable errors in the logs:
{ "insertId": "op5hl1g1bhcbgu", "jsonPayload": { "lineno": 974, "message": "Module raised an exception for failing to TPU node IP: /root/.local/lib/python3.7/site-packages/tensorflow/core/kernels/libtfkernel_sobol_op.so: undefined symbol: _ZN10tensorflow14kernel_factory17OpKernelRegistrar12InitInternalEPKNS_9KernelDefEN4absl12lts_2021032411string_viewESt10unique_ptrINS0_15OpKernelFactoryESt14default_deleteIS9_EE.", "created": 1640311585.6053097, "pathname": "/runcloudml.py", "levelname": "ERROR" }, "resource": { "type": "ml_job", "labels": { "task_name": "master-replica-0", "project_id": "ftc18053-ml", "job_id": "train_32b464d7df9948a99ee821998ae95158" } }, "timestamp": "2021-12-24T02:06:25.605309724Z", "severity": "ERROR", "labels": { "compute.googleapis.com/resource_name": "cmle-training-992395368558587168", "compute.googleapis.com/resource_id": "1015924459236267751", "ml.googleapis.com/trial_id": "", "ml.googleapis.com/trial_type": "", "ml.googleapis.com/job_id/log_area": "root", "compute.googleapis.com/zone": "us-central1-b" }, "logName": "projects/ftc18053-ml/logs/master-replica-0", "receiveTimestamp": "2021-12-24T02:06:27.410491943Z" }
{ "textPayload": "Internal error occurred for the current attempt.", "insertId": "uj8mydd16b1", "resource": { "type": "ml_job", "labels": { "task_name": "service", "project_id": "ftc18053-ml", "job_id": "train_32b464d7df9948a99ee821998ae95158" } }, "timestamp": "2021-12-24T02:06:51.449631524Z", "severity": "ERROR", "labels": { "ml.googleapis.com/endpoint": "" }, "logName": "projects/ftc18053-ml/logs/ml.googleapis.com%2Ftrain_32b464d7df9948a99ee821998ae95158", "receiveTimestamp": "2021-12-24T02:06:52.203683419Z" }
Earlier in the logs I see these errors, but it seems to move past them: