FIRST-Tech-Challenge / fmltc

FIRST Machine Learning Toolchain
Other
38 stars 14 forks source link

FAILED job help #203

Closed DaveWhitmer closed 2 years ago

DaveWhitmer commented 2 years ago

I was asked to come to the repo for help and not to use the forum. In the forum response I got pointed to where the jobs are and indicated that the most likely cause was running out of memory due to large video resolution. The video uploaded was 1280x720.

Examining the logs I don't see any OOM indications. I see multiple references to an integer error on 256.0. I've included screenshots of the logs filtered by error. When I download the dataset files they aren't in a format where I can search for the 256.0. I'm not sure how to troubleshoot this from here.

image image

lizlooney commented 2 years ago

It appears to have failed while parsing the pipeline.config file. You can find that file in cloud storage https://console.cloud.google.com/storage/browser/-blobs//models//pipeline.config (You'll need to replace the project id, team_uuid, and model_uuid.)

I suspect it might be the batch_size. Can you let me know if the value of batch_size is 256.0? If so, it's definitely a bug and I can fix it ASAP.

lizlooney commented 2 years ago

Yep. I can reproduce the issue.

lizlooney commented 2 years ago

Thank you very much for reporting this issue.