Ch10 - can't run model.py - error with hypertune

jgammerman commented 1 year ago

Hello,

So on p.338 of the book it says:

But when I run this I get the following error:

Traceback (most recent call last):
  File "/home/jgammerman/data-science-on-gcp/10_mlops/model.py", line 331, in <module>
    train_and_evaluate(TRAIN_DATA_PATTERN, EVAL_DATA_PATTERN, TEST_DATA_PATTERN, OUTPUT_MODEL_DIR, OUTPUT_DIR)
  File "/home/jgammerman/data-science-on-gcp/10_mlops/model.py", line 180, in train_and_evaluate
    hpt = hypertune.HyperTune()
AttributeError: module 'hypertune' has no attribute 'HyperTune'

I have pip installed hypertune on my VM so I know it's there:

jgammerman@cloudshell:~/data-science-on-gcp/.......$ pip show hypertune
Name: hypertune
Version: 1.0.3
Summary: A library for performing hyperparameter optimization with Polyaxon.
Home-page: https://github.com/polyaxon/hypertune
Author: Polyaxon, Inc.
Author-email: contact@polyaxon.com
License: Apache 2.0
Location: /home/jgammerman/.local/lib/python3.9/site-packages

lakshmanok commented 1 year ago

Make sure you are pip installing to the right Python installation:

python3 -m pip install hypertune and python3 -m pip show hypertune

thanks Lak

On Tue, Feb 14, 2023 at 9:18 AM James @.***> wrote:

Hello,

So on p.338 of the book it says:

[image: image] https://user-images.githubusercontent.com/8484188/218809208-d8b90fb4-36b0-41eb-bf9b-03187670438a.png

But when I run this I get the following error:

Traceback (most recent call last): File "/home/jgammerman/data-science-on-gcp/10_mlops/model.py", line 331, in train_and_evaluate(TRAIN_DATA_PATTERN, EVAL_DATA_PATTERN, TEST_DATA_PATTERN, OUTPUT_MODEL_DIR, OUTPUT_DIR) File "/home/jgammerman/data-science-on-gcp/10_mlops/model.py", line 180, in train_and_evaluate hpt = hypertune.HyperTune() AttributeError: module 'hypertune' has no attribute 'HyperTune'

I have pip installed hypertune on my VM so I know it's there:

@.:~/data-science-on-gcp/.......$ pip show hypertune Name: hypertune Version: 1.0.3 Summary: A library for performing hyperparameter optimization with Polyaxon. Home-page: https://github.com/polyaxon/hypertune Author: Polyaxon, Inc. Author-email: @. License: Apache 2.0 Location: /home/jgammerman/.local/lib/python3.9/site-packages

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/data-science-on-gcp/issues/166, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANJPZ6XEIUFDVLI6PVT3W3WXO44XANCNFSM6AAAAAAU32THIU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jgammerman commented 1 year ago

Tried that - same error as before

lakshmanok commented 1 year ago

ah. wrong hypertune! Please uninstall that one; this is the one you need to pip install:

https://pypi.org/project/cloudml-hypertune/

thanks Lak

On Tue, Feb 14, 2023 at 9:36 AM James @.***> wrote:

Tried that - same error as before

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/data-science-on-gcp/issues/166#issuecomment-1430126168, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANJPZ7FTNM2PIJPMOVBLRLWXO7BVANCNFSM6AAAAAAU32THIU . You are receiving this because you commented.Message ID: @.***>

jgammerman commented 1 year ago

That made it work, thanks Lak.

(By the way for anyone else at this stage, you might need to run export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python to get it to work.)

I then tried running the pipeline _train_onvertexai.py and it spent about 10 mins training before failing due to a memory error:

RuntimeError: Training failed with: code: 8 message: "The following quota metrics exceed quota limits: aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus"

I'm currently running the same pipeline using AutoML and it's been training for 2 hours so far - should it take that long?

lakshmanok commented 1 year ago

(1) Could you add the "export" instruction to the README instructions on GitHub?

https://github.com/GoogleCloudPlatform/data-science-on-gcp/tree/main/10_mlops/README.md

(2) I believe that's a quota error you are getting. You don't have the quota for one Nvidia T4 GPU. You may need to request that quota. If you have a different GPU available, change this line appropriately: accelerator_type=aip.AcceleratorType.NVIDIA_TESLA_T4.name, accelerator_count=1,

(3) It should finish in a little over 2 hours: budget_milli_node_hours=(300 if develop_mode else 2000),

On Wed, Feb 15, 2023 at 8:53 AM James @.***> wrote:

That made it work, thanks Lak.

(By the way for anyone else at this stage, you might need to run export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python to get it to work.)

I then tried running the pipeline train_on_vertexai.py and it spent about 10 mins training before failing due to a memory error:

RuntimeError: Training failed with: code: 8 message: "The following quota metrics exceed quota limits: aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus"

I'm currently running the same pipeline using AutoML and it's been training for 2 hours so far - should it take that long?

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/data-science-on-gcp/issues/166#issuecomment-1431678798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANJPZ45NV7WIPKMV5NO2YTWXUC2BANCNFSM6AAAAAAU32THIU . You are receiving this because you commented.Message ID: @.***>

jgammerman commented 1 year ago

1) Done, submitted a pull request

2) I do have an NVIDIA T4 GPU attached but it's still failing with the same error:

3) AutoML pipeline ended up completing after 3 hours 40 mins.

jgammerman commented 1 year ago

I'm getting the same error when trying to run hyperparameter tuning:

google.api_core.exceptions.ResourceExhausted: 429 The following quota metrics exceed quota limits: aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus

lakshmanok commented 1 year ago

Note the name of the quota that you need increased: aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus Please look in console.cloud.google.com/quotas for a quota with this name

thanks Lak

On Thu, Feb 16, 2023 at 7:22 AM James @.***> wrote:

I'm getting the same error when trying to run hyperparameter tuning:

google.api_core.exceptions.ResourceExhausted: 429 The following quota metrics exceed quota limits: aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/data-science-on-gcp/issues/166#issuecomment-1433256805, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANJPZ7X3FF25KG2JRKIXODWXZA2TANCNFSM6AAAAAAU32THIU . You are receiving this because you commented.Message ID: @.***>

lakshmanok commented 1 year ago

Also, remember that if your quota is 1 GPU, and you have VM already created with this GPU, you have exhausted the quota. The Vertex AI managed service won't have a GPU available. Similarly, if you are doing hyperparm training with 4 parallel workers, you need 4 GPUs in your quota

On Thu, Feb 16, 2023 at 7:30 AM Lakshmanan Valliappa @.***> wrote:

Note the name of the quota that you need increased: aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus Please look in console.cloud.google.com/quotas for a quota with this name

thanks Lak

On Thu, Feb 16, 2023 at 7:22 AM James @.***> wrote:

I'm getting the same error when trying to run hyperparameter tuning:

google.api_core.exceptions.ResourceExhausted: 429 The following quota metrics exceed quota limits: aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/data-science-on-gcp/issues/166#issuecomment-1433256805, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANJPZ7X3FF25KG2JRKIXODWXZA2TANCNFSM6AAAAAAU32THIU . You are receiving this because you commented.Message ID: @.*** com>

jgammerman commented 1 year ago

When navigating to console.cloud.google.com/quotas it says that I need to upgrade to a paid account:

I'm still using a managed service (I tried creating a user-managed one a few days ago but it didn't work, something about not enough GPUs currently being available....I just put it down to bad timing and decided to try again later). I guess that's the root of the problem. Will try again.

jgammerman commented 1 year ago

So I still can't create a user-managed notebook with a GPU:

Have tried US-west, -east and -central. Sometimes I also get this error:

GoogleCloudPlatform / data-science-on-gcp

Ch10 - can't run model.py - error with hypertune #166