FIRST-Tech-Challenge / fmltc

FIRST Machine Learning Toolchain
Other
38 stars 14 forks source link

Build Failed: flask 2.2.5 has requirement Werkzeug>=2.2.2, but you have werkzeug 2.0.1. #316

Closed acharraggi closed 9 months ago

acharraggi commented 9 months ago

I'm trying to set up a server for the Centerstage season. I did this last year successfully, but deleted the project when I was done.

I got to the Deploy everything step and was working on scripts/deploy_cloud_function.sh when I got a fatal error that I don't think I can work around since neither Flask nor werkzeug are locally installed so I'm assuming their internal to the project.

Looks like a new version of Flask was pulled in without a corresponding change to werkzeug. here's the command I ran and the output. I didn't get the Warnings last time, but they don't look fatal.

I checked the detailed build log and it looks like werkzeug was initally installed with a good version, but that was uninstalled and replaced with an older version. See the attached CSV log file.

mike@mike-XPS-15-7590:~/fmltc$ source env_setup.sh scripts/deploy_cloud_function.sh FMLTC_GCLOUD_PROJECT_ID is ftcml-399316 FMLTC_CLOSURE_COMPILER_JAR is ../closure-compiler/closure-compiler-v20200406.jar FMLTC_CLOSURE_LIBRARY_FOLDER is ../closure-library/closure-library-20200406 ~/fmltc/server ~/fmltc WARNING: Effective May 15, 2023, Container Registry (used by default by Cloud Functions 1st gen for storing build artifacts) is deprecated: https://cloud.google.com/artifact-registry/docs/transition/transition-from-gcr. Artifact Registry is the recommended successor that you can use by adding the '--docker_registry=artifact-registry' flag.

WARNING: Secuirty check for Container Registry repository that stores this function's image has not succeeded. To mitigate risks of disclosing sensitive data, it is recommended to keep your repositories private. This setting can be verified in Google Container Registry.

Deploying function (may take a while - up to 2 minutes)...⠹
For Cloud Build Logs, visit: https://console.cloud.google.com/cloud-build/builds;region=us-central1/52481132-ecc8-495a-aab9-ca481293589e?project=663946592916 Deploying function (may take a while - up to 2 minutes)...failed.
ERROR: (gcloud.functions.deploy) OperationError: code=3, message=Build failed: found incompatible dependencies: "flask 2.2.5 has requirement Werkzeug>=2.2.2, but you have werkzeug 2.0.1."; Error ID: 78df1cfa ~/fmltc

file:///home/mike/Downloads/downloaded-logs-20230917-123024.csv

acharraggi commented 9 months ago

downloaded-logs-20230917-123024.csv

cmacfarl commented 9 months ago

This...

https://github.com/FIRST-Tech-Challenge/fmltc/blob/main/server/app_engine/requirements.txt

...specifies flask 2.0.2. So I'm wondering where it's thinking it needs to pull 2.2.5 from.

acharraggi commented 9 months ago

No idea, but I see flask 2.25 being requested in lines 945 of the log.

There's a different server/requirements.txt file that mentions Werkzeug but not Flask. Maybe if it did, it would downgrade Flask before the final step 2 build. https://github.com/FIRST-Tech-Challenge/fmltc/blob/main/server/requirements.txt

acharraggi commented 9 months ago

Hey, that may have worked. At least the build finished for deploy_cloud_function.sh. I'll keep going and see how far I get.

From the log: Step #2 - "build": Uninstalling Flask-2.2.5: 2023-09-17 16:52:34.899 PDT Step #2 - "build": Successfully uninstalled Flask-2.2.5 2023-09-17 16:52:49.349 PDT Step #2 - "build": Successfully installed Cython-3.0.2 Flask-2.0.2 ...

cmacfarl commented 9 months ago

I'm glad that was enough of a clue to trigger a possible solution for you.

The cloud functions definitely use python and the Werkzeug dependency likely induces a dependency on flask. It could be that the default flask version was bumped such that if it's not specified explicitly it chooses a Werkzeug incompatible version. It you want to create a PR for this change I'll go ahead and merge. Looks like in any case flask is required for the cloud functions and relying upon some other software to choose the flask version appears to have broken.

acharraggi commented 9 months ago

I'll do the pull request if I get things working, but there might be other things to fix.

I had another issue where the videos failed to upload. I had to rebuild (wrong Origin in the env_variables.yaml file). I was able to upload videos, label them and create a dataset. So far so good.

But I was unable to start the training, it just just said "Failed!".
The gcloud app log shows that it was a quota error:

I was not able to even see a quota until I enabled Vertex AI API (new as of May 2023) , but when I list the quotas for custom_model_training_nvidia_k80_gpus the quota is zero in all regions. I did see mention of K80 end of life, but not until May 2024. In theory, I could request a quota increase, but since it is zero everywhere it might be an EOL issue. Do we need to switch to a Vertex AI API processor?

Is this some underlaying change to google services? Do I need to enable something else? Do we need a deployment config change?

Anyone else got this working lately?
Thanks.

cmacfarl commented 9 months ago

Which zone are you hosting in?

cmacfarl commented 9 months ago

In addition to the above, do you have the Compute Engine API enabled? ftc-ml does not use Vertex AI, so it's unclear why that would be required.

If you go to https://console.cloud.google.com/apis/api/compute.googleapis.com/quotas and filter for k80 do you see that that API is enabled, and you have quota?

acharraggi commented 9 months ago

my zone is us-west1,

Compute Engine API is enabled: Compute Engine API Creates and runs virtual machines on Google Cloud Platform. ... Status Enabled

I used that link to go to my quotas. No training model quotas, but here are the K80 quotas I could see

But I'm not sure any of these apply. My error message says my quota is zero.

What API or quota am I looking for?

acharraggi commented 9 months ago

here's the log message:

2023-09-19 00:40:28 default[v1] CRITICAL:root:model_trainer.start_training_model - creating eval job - except Traceback (most recent call last): ... File "/workspace/model_trainer.py", line 308, in start_training_model ... eval_job_response = ml.projects().jobs().create(parent=parent, body=eval_job).execute() ... File "/layers/google.python.pip/pip/lib/python3.9/site-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper ... return wrapped(args, kwargs) ... File "/layers/google.python.pip/pip/lib/python3.9/site-packages/googleapiclient/http.py", line 907, in execute ... raise HttpError(resp, content, uri=self.uri) ... googleapiclient.errors.HttpError: <HttpError 429 when requesting https://ml.googleapis.com/v1/projects/ftcml-399316/jobs?alt=json returned "Quota failure for project ftcml-399316. The request for 1 K80 accelerators exceeds the allowed maximum of 0 A100, 0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V2_POD, 0 TPU_V3, 0 TPU_V3_POD, 0 V100 accelerators. To read more about Cloud ML Engine quota, see https://cloud.google.com/ml-engine/quotas.". Details: "[{'@type': 'type.googleapis.com/google.rpc.QuotaFailure', 'violations': [{'subject': 'ftcml-399316', 'description': 'The request for 1 K80 accelerators exceeds the allowed maximum of 0 A100, 0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V2_POD, 0 TPU_V3, 0 TPU_V3_POD, 0 V100 accelerators.'}]}]"> ... 2023-09-19 00:40:28 default[v1] CRITICAL:root:capture_exception traceback: Traceback (most recent call last): ... File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 1516, in full_dispatch_request ... rv = self.dispatch_request() ... File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 1502, in dispatch_request ... return self.ensure_sync(self.view_functions[rule.endpoint])(req.view_args) ... File "/workspace/wrappers.py", line 85, in wrapper ... return func(args, kwargs) ... File "/workspace/wrappers.py", line 45, in wrapper ... return func(*args, *kwargs) ... File "/workspace/app_engine.py", line 1294, in start_training_model ... model_entity = model_trainer.start_training_model(team_uuid, description, dataset_uuids, ... File "/workspace/model_trainer.py", line 308, in start_training_model ... eval_job_response = ml.projects().jobs().create(parent=parent, body=eval_job).execute() ... File "/layers/google.python.pip/pip/lib/python3.9/site-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper ... return wrapped(args, kwargs) ... File "/layers/google.python.pip/pip/lib/python3.9/site-packages/googleapiclient/http.py", line 907, in execute ... raise HttpError(resp, content, uri=self.uri) ... googleapiclient.errors.HttpError: <HttpError 429 when requesting https://ml.googleapis.com/v1/projects/ftcml-399316/jobs?alt=json returned "Quota failure for project ftcml-399316. The request for 1 K80 accelerators exceeds the allowed maximum of 0 A100, 0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V2_POD, 0 TPU_V3, 0 TPU_V3_POD, 0 V100 accelerators. To read more about Cloud ML Engine quota, see https://cloud.google.com/ml-engine/quotas.". Details: "[{'@type': 'type.googleapis.com/google.rpc.QuotaFailure', 'violations': [{'subject': 'ftcml-399316', 'description': 'The request for 1 K80 accelerators exceeds the allowed maximum of 0 A100, 0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V2_POD, 0 TPU_V3, 0 TPU_V3_POD, 0 V100 accelerators.'}]}]"> ...

cmacfarl commented 9 months ago

Which us-west1 region? There are three and only one, us-west1-b supports the K80.

See the table here: https://cloud.google.com/compute/docs/gpus/gpu-regions-zones

acharraggi commented 9 months ago

us-west1 IS a region, it has three zones. us-west1-b is a zone in that region. As far as I can tell App Engine is not zone specific, it just says Region: us-west1.

I tried creating a new project this morning, and it wouldn't let me set us-west1-b. Google only lists regions as options when creating an App Engine.

Can I tell the machine learning stuff to run in a particular zone?

Do you have it working in a particular region? I can try using that region instead.

acharraggi commented 9 months ago

I've created the pull request.

I'm still not able to make this project run properly. I've rebuilt in us-east1 and it still fails with no K80's available.