FIRST-Tech-Challenge / fmltc

FIRST Machine Learning Toolchain
Other
38 stars 14 forks source link

Internal Error for current attempt #244

Closed DaveWhitmer closed 2 years ago

DaveWhitmer commented 2 years ago

Jobs are failing on our model training for us with 2 noticeable errors in the logs:

{ "insertId": "op5hl1g1bhcbgu", "jsonPayload": { "lineno": 974, "message": "Module raised an exception for failing to TPU node IP: /root/.local/lib/python3.7/site-packages/tensorflow/core/kernels/libtfkernel_sobol_op.so: undefined symbol: _ZN10tensorflow14kernel_factory17OpKernelRegistrar12InitInternalEPKNS_9KernelDefEN4absl12lts_2021032411string_viewESt10unique_ptrINS0_15OpKernelFactoryESt14default_deleteIS9_EE.", "created": 1640311585.6053097, "pathname": "/runcloudml.py", "levelname": "ERROR" }, "resource": { "type": "ml_job", "labels": { "task_name": "master-replica-0", "project_id": "ftc18053-ml", "job_id": "train_32b464d7df9948a99ee821998ae95158" } }, "timestamp": "2021-12-24T02:06:25.605309724Z", "severity": "ERROR", "labels": { "compute.googleapis.com/resource_name": "cmle-training-992395368558587168", "compute.googleapis.com/resource_id": "1015924459236267751", "ml.googleapis.com/trial_id": "", "ml.googleapis.com/trial_type": "", "ml.googleapis.com/job_id/log_area": "root", "compute.googleapis.com/zone": "us-central1-b" }, "logName": "projects/ftc18053-ml/logs/master-replica-0", "receiveTimestamp": "2021-12-24T02:06:27.410491943Z" }

{ "textPayload": "Internal error occurred for the current attempt.", "insertId": "uj8mydd16b1", "resource": { "type": "ml_job", "labels": { "task_name": "service", "project_id": "ftc18053-ml", "job_id": "train_32b464d7df9948a99ee821998ae95158" } }, "timestamp": "2021-12-24T02:06:51.449631524Z", "severity": "ERROR", "labels": { "ml.googleapis.com/endpoint": "" }, "logName": "projects/ftc18053-ml/logs/ml.googleapis.com%2Ftrain_32b464d7df9948a99ee821998ae95158", "receiveTimestamp": "2021-12-24T02:06:52.203683419Z" }

Earlier in the logs I see these errors, but it seems to move past them:
image

lizlooney commented 2 years ago

I'm not sure what the problem is, but you might be able to get around it by running your jobs on GPU instead of TPU. They'll run slower, but it might work better.

To do that, go to https://console.cloud.google.com/firestore/data?project=

  1. In the left column (under Root), look for Configuration.
  2. If Configuration is already there:
    1. Click Configuration
    2. In the middle column, there should be only one document. Click on it.
    3. In the right column, look for the use_tpu field.
    4. If use_tpu is already there:
      1. Click on use_tpu. A dialog should appear.
      2. In the dialog, change Field value to false.
      3. Click UPDATE.
    5. If use_tpu is not already there:
      1. Click Add Field. A dialog should appear.
      2. For Field name, enter "use_tpu".
      3. For Field type, choose boolean.
      4. For Field value, choose false.
      5. Click SAVE FIELD.
  3. If Configuration is not already there:
    1. Click START COLLECTION.
    2. For Collection ID, enter "Configuration".
    3. Leave Document ID blank.
    4. For Field name, enter "use_tpu".
    5. For Field type, choose boolean.
    6. For Field value, choose false.
    7. Click SAVE.
  4. Go to \<your url>/admin
  5. Under configuration, make sure it says "Use TPU: False". If it says "Use TPU: True", click Refresh Cache.

I'll investigate the problem with TPU.

lizlooney commented 2 years ago

I can't reproduce the problem on my own GCP project. Is this something that happens all the time on your GCP project? Can you post more of the logs? Is there a stack trace near the "Module raised an exception for failing to TPU node IP" error?

DaveWhitmer commented 2 years ago

It did not happen when I first tried it. I successfully built a model.
All I see is a warning and a deprecation notice other than these errors. I'll try the GPU and let you know. image image

lizlooney commented 2 years ago

Is your repo up-to-date? On November 25th I merged a pull request that fixed some problems we were having with training jobs failing.

That pull request makes it so for training on GPU, we use a docker image. So the README was changed to add steps for deploying the docker image to Google's Container Registry.

That pull request also changes the code we use for training on TPU, although it doesn't use a docker image; it uses a tar.gz file.

I suggest you update your repo to the latest and then rerun all of the deploy_... scripts.

DaveWhitmer commented 2 years ago

Thank you. I should have thought to check that. I'll get that done and report back.

DaveWhitmer commented 2 years ago

I can open a new issue if you would like. But after going through a completely fresh install of the latest release to a new project I can not upload videos to the new install. I get the following error: image I posted the error in the log down below related to max instances. In researching this error, I checked and there is not a max instances set. Also, the old instance with the previous version I was running still uploads videos just fine. image

{ "textPayload": "The request was aborted because there was no available instance. Additional troubleshooting documentation can be found at: https://cloud.google.com/functions/docs/troubleshooting#scalability", "insertId": "61c71faf000625be5559db07", "httpRequest": { "requestMethod": "POST", "requestUrl": "https://bdff49c523a46334a80dbf38f8ac5420-dot-t5fce678af9622bd4p-tp.appspot.com/_ah/push-handlers/pubsub/projects/t5fce678af9622bd4p-tp/topics/cloud-functions-gip7atb43krcpn4pebqinfkkuq", "requestSize": "2422", "status": 500, "userAgent": "CloudPubSub-Google", "remoteIp": "2002:a05:6681:321a:b0:b6:249d:9acf", "latency": "0s", "protocol": "HTTP/1.1" }, "resource": { "type": "cloud_function", "labels": { "project_id": "ftc18053-ml-2", "function_name": "perform_action", "region": "us-central1" } }, "timestamp": "2021-12-25T13:42:07.402878Z", "severity": "ERROR", "logName": "projects/ftc18053-ml-2/logs/cloudfunctions.googleapis.com%2Frequests", "trace": "projects/ftc18053-ml-2/traces/97a7e1ab9517f69f7e6c787a7fc0d482", "receiveTimestamp": "2021-12-25T13:42:07.406309207Z" }

DaveWhitmer commented 2 years ago

Going to reinstall and try and put in a different region. Old instance was using us-central and new one was using us-east4. I'll report back.

DaveWhitmer commented 2 years ago

I've done 4 completely fresh tries and I cannot get another working version. I get this error on the deploy_cloud_function:
image When I inspect the logs though, there are not errors and it says the build succeeded. image Just to make sure, I still completed the remaining instructions and I get a 502 Bad Gateway error when accessing the site.

lizlooney commented 2 years ago

Can you check your app engine logs? Usually with a 502, there is a python stack trace logged. That stack trace might also be what's preventing the cloud function from deploying.

Also, can you do git log and git status and make sure you are at commit 53ef4cb3f30f095938ea8d607dccf9dfbe4b73f9 and you don't have any local changes?

I just deployed everything (to my existing google cloud project, not a new one) and it all worked fine.

DaveWhitmer commented 2 years ago

Just got it all working. I found the error that was preventing the redeployment. Posting it here for anyone else that may have the issue. Even though I did it right the first time, I kept changing the editor role to the secrets accessor role instead of adding it when I was trying to deploy to a different region. After deploying to Oregon instead of Virginia, I can now upload a video. Still can't do it to the Virginia project, so deleting that one.

And now have gotten all the way through producing a model. Thank you so much for the help. This is such a great project. Can't wait to watch the kids with it!

lizlooney commented 2 years ago

Oh, that is excellent!!!