GoogleCloudPlatform / data-science-on-gcp

Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Apache License 2.0
1.31k stars 712 forks source link

CH10 code to set TensorFlow level is broken now that there are 2 digits second level of current TensorFlow #168

Closed ryanmark1867 closed 1 year ago

ryanmark1867 commented 1 year ago

First, great book and great examples. Your examples of training in a Vertex AI pipeline are gold. However, there is a problem with this code to set the image names in https://github.com/GoogleCloudPlatform/data-science-on-gcp/blob/edition2/10_mlops/train_on_vertexai.py

    tf_version = '2-' + tf.__version__[2:3]
    train_image = "us-docker.pkg.dev/vertex-ai/training/tf-gpu.{}:latest".format(tf_version)
    deploy_image = "us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.{}:latest".format(tf_version)

Now that current TensorFlow has two digits in the second level (e.g. "2.12.0"), this code gets the current TF version and chops off the second digit of the second level, so "2.12.0" becomes "2.1", so you get an ancient TF level in the container.

lakshmanok commented 1 year ago

Thanks! This should fix it:

tf_version = '2-' + tf_version_string.split('.')[1]

Could you try, and file a pull-request?

ryanmark1867 commented 1 year ago

Thanks very much for getting back so quickly. Unfortunately, this change won't work in cases where the TensorFlow level of the system where you are running the pipeline script is higher than what's available in pre-built images (https://cloud.google.com/vertex-ai/docs/training/pre-built-containers for training and https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers for prediction).

I am running the pipeline script from Cloud Shell which has TensorFlow 2.12.0, which means that when I update my pipeline script to use your recommended fix, as follows:

tf_version_string = tf.__version__
print("tf.__version__ is: ",tf.__version__)
tf_version = '2-' + tf_version_string.split('.')[1]
train_image = "us-docker.pkg.dev/vertex-ai/training/tf-gpu.{}:latest".format(tf_version)
deploy_image = "us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.{}:latest".format(tf_version)
print("train_image is: ",train_image)
print("deploy_image: ",deploy_image)

I get the following output:

tf.__version__ is:  2.12.0
train_image is:  us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-12:latest
deploy_image:  us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-12:latest

And because pre-built images aren't currently available beyond TF 2.11, my pipeline script fails with the following message:

details = "The image 'us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-12:latest' is not supported. Please use an image offered by Vertex AI for python package training."

So, while the fix you recommended should work in general, it won't work for my particular use case (where the TF level where I am running the pipeline script is a bit ahead of the TF level where I will run the model). I will use hardcoded container names in a config file (https://github.com/ryanmark1867/deep_learning_ml_pipeline/blob/master/pipeline_config.yml). Thanks again for recommending the fix, and thanks especially for your excellent book.