GoogleCloudPlatform / cloudml-samples

Cloud ML Engine repo. Please visit the new Vertex AI samples repo at https://github.com/GoogleCloudPlatform/vertex-ai-samples
https://cloud.google.com/ai-platform/docs/
Apache License 2.0
1.52k stars 857 forks source link

"Serving Text Classification Using PyTorch and AI Platform" notebook doesn't seem to work on Colab #410

Closed bolaft closed 2 years ago

bolaft commented 5 years ago

Describe the bug

The notebook cloudml-samples/notebooks/pytorch/Text Classification Using PyTorch and CMLE.ipynb does not seem to work on Google Colab.

In the following command:

!gcloud alpha ml-engine versions create {VERSION_NAME} --model {MODEL_NAME} \
--origin=gs://{BUCKET}/{MODEL_DIR}/ \
--python-version=3.5 \
--runtime-version={RUNTIME_VERSION} \
--framework='SCIKIT_LEARN' \
--package-uris=gs://{BUCKET}/{PACKAGES_DIR}/my_package-0.1.tar.gz \
--machine-type=mls1-c4-m4 \
--model-class=model_prediction.CustomModelPrediction

It fails because "model-class" is not a recognized argument. I believe it should be "prediction-class".

After changing "model-class" for "prediction-class", the command fails again with the following error message :

ERROR: (gcloud.alpha.ml-engine.versions.create) Bad model detected with error:  "Error loading the model"

What sample is this bug related to?

cloudml-samples/notebooks/pytorch/Text Classification Using PyTorch and CMLE.ipynb

Source code / logs

ERROR: (gcloud.alpha.ml-engine.versions.create) Bad model detected with error:  "Error loading the model"

When adding the --verbosity debug argument:

WARNING: The `gcloud ml-engine` commands have been renamed and will soon be removed. Please use `gcloud ai-platform` instead.
DEBUG: Running [gcloud.alpha.ml-engine.versions.create] with arguments: [--framework: "scikit-learn", --machine-type: "mls1-c4-m4", --model: "torch_text_classification_us", --origin: "gs://b4nlp_bucket/torch_text_classification/models/", --package-uris: "[u'gs://b4nlp_bucket/torch_text_classification/packages/my_package-0.1.tar.gz']", --prediction-class: "model_prediction.CustomModelPrediction", --python-version: "3.5", --runtime-version: "1.13", --verbosity: "debug", VERSION: "v201903"]
DEBUG: (gcloud.alpha.ml-engine.versions.create) Bad model detected with error:  "Error loading the model"
Traceback (most recent call last):
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 985, in Execute
    resources = calliope_command.Run(cli=self, args=args)
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 795, in Run
    resources = command_instance.Run(args)
  File "/tools/google-cloud-sdk/lib/surface/ai_platform/versions/create.py", line 201, in Run
    accelerator_config=accelerator)
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/command_lib/ml_engine/versions_util.py", line 113, in Create
    message='Creating version (this might take a few minutes)...')
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/command_lib/ml_engine/versions_util.py", line 74, in WaitForOpMaybe
    return operations_client.WaitForOperation(op, message=message).response
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/api_lib/ml_engine/operations.py", line 114, in WaitForOperation
    sleep_ms=5000)
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 264, in WaitFor
    sleep_ms, _StatusUpdate)
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 326, in PollUntilDone
    sleep_ms=sleep_ms)
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/core/util/retry.py", line 229, in RetryOnResult
    if not should_retry(result, state):
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 320, in _IsNotDone
    return not poller.IsDone(operation)
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 122, in IsDone
    raise OperationError(operation.error.message)
OperationError: Bad model detected with error:  "Error loading the model"
ERROR: (gcloud.alpha.ml-engine.versions.create) Bad model detected with error:  "Error loading the model"

To Reproduce Steps to reproduce the behavior:

  1. Log on to a google account with Google Cloud ai-platform and storage permissions & billing enabled
  2. Open the cloudml-samples/notebooks/pytorch/Text Classification Using PyTorch and CMLE.ipynb notebook
  3. Click on 'Open in Google Colab'
  4. Follow instructions
  5. See error at the model version deployment step

Expected behavior

The model version should be correctly deployed.

System Information

ksalama commented 5 years ago

@bolaft - I have updated the notebook. Basically, we need to use Python 2.7 instead of 3.5

Could you please try the updated notebook?

bolaft commented 5 years ago

@ksalama I tried the updated notebook, I still have an error on the same step, but a different one:

DEBUG: Running [gcloud.beta.ai-platform.versions.create] with arguments: [--framework: "scikit-learn", --machine-type: "mls1-c4-m4", --model: "torch_text_classification", --origin: "gs://b4nlp_bucket/torch_text_classification/models/", --package-uris: "[u'gs://b4nlp_bucket/torch_text_classification/packages/my_package-0.1.tar.gz']", --prediction-class: "model_prediction.CustomModelPrediction", --python-version: "2.7", --runtime-version: "1.12", --verbosity: "debug", VERSION: "v201903"]
DEBUG: (gcloud.beta.ai-platform.versions.create) Internal error.
Traceback (most recent call last):
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 985, in Execute
    resources = calliope_command.Run(cli=self, args=args)
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 795, in Run
    resources = command_instance.Run(args)
  File "/tools/google-cloud-sdk/lib/surface/ai_platform/versions/create.py", line 158, in Run
    package_uris=args.package_uris)
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/command_lib/ml_engine/versions_util.py", line 113, in Create
    message='Creating version (this might take a few minutes)...')
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/command_lib/ml_engine/versions_util.py", line 74, in WaitForOpMaybe
    return operations_client.WaitForOperation(op, message=message).response
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/api_lib/ml_engine/operations.py", line 114, in WaitForOperation
    sleep_ms=5000)
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 264, in WaitFor
    sleep_ms, _StatusUpdate)
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 326, in PollUntilDone
    sleep_ms=sleep_ms)
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/core/util/retry.py", line 229, in RetryOnResult
    if not should_retry(result, state):
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 320, in _IsNotDone
    return not poller.IsDone(operation)
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 122, in IsDone
    raise OperationError(operation.error.message)
OperationError: Internal error.
ERROR: (gcloud.beta.ai-platform.versions.create) Internal error.
bolaft commented 5 years ago

Any news? I tried it again and consistently get an "internal error". Did the updated notebook work on your end?

nnegrey commented 5 years ago

Hi sorry about this and the miscommunication.

Working to resolve this issue and clear things up.

The high level notebooks folder is intended for samples to be run on AI Platform Notebooks and we do not guarantee that these samples will work on CoLab.

I've added this PR to help clarify this and will be looking at ways to make this clearer across the repo. https://github.com/GoogleCloudPlatform/cloudml-samples/pull/416

@ksalama if you intend for it to work with both, we should talk about opening a highlevel colab directory or moving the samples to ML on GCP

ksalama commented 5 years ago

@nnegrey In fact, it is an issue with the Beta feature of Custom Prediction Routines (rather than running it in Colab). I raised an internal bug and the engineering team is looking into it.

StanislawSmyl commented 5 years ago

Hi, any updates on this? I am facing the same issue as OP and cannot serve pytorch model with Platform AI.

bolaft commented 5 years ago

I noticed some changes were made in the backend, now you can't use both "--prediction-class" and "--framework" options when deploying a version.

Even after removing the "--framework" option from the notebook, the sample still fails, both on AI platform and on Colab. It gives an "internal error occured" message.

gogasca commented 5 years ago

We are investigating this issue internally, will provide an update soon.

mkraice1 commented 5 years ago

Did anyone find a work-around for this?

gilbblig commented 5 years ago

I noticed some changes were made in the backend, now you can't use both "--prediction-class" and "--framework" options when deploying a version.

Even after removing the "--framework" option from the notebook, the sample still fails, both on AI platform and on Colab. It gives an "internal error occured" message.

Facing the very same issue here.

nnegrey commented 5 years ago

Hi folks, still no update on why this is happening or what is causing it.

However, you can also jump to this issue here: https://issuetracker.google.com/issues/132823509 (UPDATED to correct link) Add your +1 and similar experience there.

gilbblig commented 5 years ago

I don't have access to this site. Is there any news yet, or maybe someone found a workaround?

Thanks in advance!

mkraice1 commented 5 years ago

Hi folks, still no update on why this is happening or what is causing it.

However, you can also jump to this issue here: https://b.corp.google.com/issues/132823509 Add your +1 and similar experience there.

That site seems blocked, or is for google employees only

nnegrey commented 5 years ago

Oops sorry. It should be https://issuetracker.google.com/issues/132823509

From this page: https://cloud.google.com/support/docs/issue-trackers

thedriftofwords commented 5 years ago

I tried running this sample notebook in Colab.

I made some updates to this model deployment command:

!gcloud beta ai-platform versions create {VERSION_NAME} --model {MODEL_NAME} \
--origin=gs://{BUCKET}/{MODEL_DIR}/ \
--python-version=3.5 \
--runtime-version={RUNTIME_VERSION} \
--package-uris=gs://{BUCKET}/{PACKAGES_DIR}/my_package-0.1.tar.gz \
--machine-type=mls1-c4-m2 \
--prediction-class=model_prediction.CustomModelPrediction

After taking a moment to update my VERSION_NAME variable for another attempt, I tried running it without the --machine-type flag, in which case it defaults to using the single-core CPU:

!gcloud beta ai-platform versions create {VERSION_NAME} --model {MODEL_NAME} \
--origin=gs://{BUCKET}/{MODEL_DIR}/ \
--python-version=3.5 \
--runtime-version={RUNTIME_VERSION} \
--package-uris=gs://{BUCKET}/{PACKAGES_DIR}/my_package-0.1.tar.gz \
--prediction-class=model_prediction.CustomModelPrediction

In both cases I got the following error:

ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to have error, please contact Cloud ML.
andrewferlitsch commented 5 years ago

When issue 456 is closed, this issue can be closed as well.

gogasca commented 4 years ago

You need to use compiled packages compatible with Cloud AI Platform Package information here

This bucket containers compiled packages for PyTorch that are compatible with Cloud AI Platform prediction. The files are mirrored from the official builds at https://download.pytorch.org/whl/cpu/torch_stable.html

In order to deploy a PyTorch model on Cloud AI Platform Online Predictions, you must add one of these packages to the packageURIs field on the version you deploy. Pick the package matching your Python and PyTorch version. The package names follow this template:

Package name = torch-{TORCH_VERSION_NUMBER}-{PYTHON_VERSION}-linux_x86_64.whl where PYTHON_VERSION = cp35-cp35m for Python 3 with runtime versions < 1.15, cp37-cp37m for Python 3 with runtime versions >= 1.15

Use cp27-cp27mu for Python 2.

For example, if I were to deploy a PyTorch model based on PyTorch 1.1.0 and Python 3, my gcloud command would look like:

  1. Remove torch from setup.py
  2. Include torch package when creating your version model.
    gcloud beta ai-platform versions create {VERSION_NAME} --model {MODEL_NAME} \
    ...
    --package-uris=gs://{MY_PACKAGE_BUCKET}/my_package-0.1.tar.gz,gs://cloud-ai-pytorch/torch-1.1.0-cp35-cp35m-linux_x86_64.whl
    my_package
    !gcloud beta ai-platform versions create {VERSION_NAME} --model {MODEL_NAME} \
    --origin=gs://{BUCKET}/{MODEL_DIR}/ \
    --python-version=3.7 \
    --runtime-version={RUNTIME_VERSION} \
    --package-uris=gs://{BUCKET}/{PACKAGES_DIR}/text_classification-0.1.tar.gz,gs://cloud-ai-pytorch/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl \
    --machine-type=mls1-c4-m4 \
    --prediction-class=model_prediction.CustomModelPrediction
gogasca commented 4 years ago

@bolaft we updated the notebooks with changes, PTAL

andrewferlitsch commented 4 years ago

@bolaft - please review change

kweinmeister commented 2 years ago

This issue may no longer be relevant due to its age. Please feel free to re-open.