deep-diver / Model-Training-as-a-CI-CD-System

Demonstration of the Model Training as a CI/CD System in Vertex AI
Apache License 2.0
30 stars 8 forks source link

Creating pipeline #4

Closed hugoferrero closed 2 years ago

hugoferrero commented 2 years ago

Hi, I created a project in vertex by using the Taxi template.

$ tfx pipeline create \
      --pipeline-path=kubeflow_v2_runner.py \
      --engine=vertex \
      --build-image

...then i had this error:

requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.41/distribution/gcr.io/teco-prod-adam-dev-826c/tfx-pipeline/json

Any clue about that?...thanks in advance

deep-diver commented 2 years ago

@hugoferrero

please share configs.py. I guess something is configured wrong.

hugoferrero commented 2 years ago

Hi..here it goes.....another question: where do you set the hardware for training in vertex pipelines?

# Copyright 2020 Google LLC. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""TFX taxi template configurations.

This file defines environments for a TFX taxi pipeline.
"""

import os  # pylint: disable=unused-import

# TODO(b/149347293): Move more TFX CLI flags into python configuration.

# Pipeline name will be used to identify this pipeline.
PIPELINE_NAME = 'tfx-pipeline'

# GCP related configs.

# Following code will retrieve your GCP project. You can choose which project
# to use by setting GOOGLE_CLOUD_PROJECT environment variable.
try:
  import google.auth  # pylint: disable=g-import-not-at-top  # pytype: disable=import-error
  try:
    _, GOOGLE_CLOUD_PROJECT = google.auth.default()
  except google.auth.exceptions.DefaultCredentialsError:
    GOOGLE_CLOUD_PROJECT = 'teco-prod-adam-dev-826c'
except ImportError:
  GOOGLE_CLOUD_PROJECT = 'teco-prod-adam-dev-826c'

# Specify your GCS bucket name here. You have to use GCS to store output files
# when running a pipeline with Kubeflow Pipeline on GCP or when running a job
# using Dataflow. Default is '<gcp_project_name>-kubeflowpipelines-default'.
# This bucket is created automatically when you deploy KFP from marketplace.
GCS_BUCKET_NAME = 'hf-exp/vpoc/taxi'
GCS_OUTPUTS = 'hf-exp/vpoc/taxi/outputs'
# TODO(step 8,step 9): (Optional) Set your region to use GCP services including
#                      BigQuery, Dataflow and Cloud AI Platform.
GOOGLE_CLOUD_REGION = 'us-east4'  # ex) 'us-central1'

# Following image will be used to run pipeline components run if Kubeflow
# Pipelines used.
# This image will be automatically built by CLI if we use --build-image flag.
PIPELINE_IMAGE = f'gcr.io/{GOOGLE_CLOUD_PROJECT}/{PIPELINE_NAME}'

PREPROCESSING_FN = 'models.preprocessing.preprocessing_fn'
RUN_FN = 'models.keras_model.model.run_fn'
# NOTE: Uncomment below to use an estimator based model.
# RUN_FN = 'models.estimator_model.model.run_fn'

TRAIN_NUM_STEPS = 1000
EVAL_NUM_STEPS = 150

# Change this value according to your use cases.
EVAL_ACCURACY_THRESHOLD = 0.6

# Beam args to use BigQueryExampleGen with Beam DirectRunner.
# TODO(step 7): (Optional) Uncomment here to provide GCP related configs for
#               BigQuery.
# BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS = [
#    '--project=' + GOOGLE_CLOUD_PROJECT,
#    '--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
#    ]

# The rate at which to sample rows from the Chicago Taxi dataset using BigQuery.
# The full taxi dataset is > 120M record.  In the interest of resource
# savings and time, we've set the default for this example to be much smaller.
# Feel free to crank it up and process the full dataset!
_query_sample_rate = 0.0001  # Generate a 0.01% random sample.

# The query that extracts the examples from BigQuery.  The Chicago Taxi dataset
# used for this example is a public dataset available on Google AI Platform.
# https://console.cloud.google.com/marketplace/details/city-of-chicago-public-data/chicago-taxi-trips
# TODO(step 7): (Optional) Uncomment here to use BigQuery.
# BIG_QUERY_QUERY = """
#         SELECT
#           pickup_community_area,
#           fare,
#           EXTRACT(MONTH FROM trip_start_timestamp) AS trip_start_month,
#           EXTRACT(HOUR FROM trip_start_timestamp) AS trip_start_hour,
#           EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS trip_start_day,
#           UNIX_SECONDS(trip_start_timestamp) AS trip_start_timestamp,
#           pickup_latitude,
#           pickup_longitude,
#           dropoff_latitude,
#           dropoff_longitude,
#           trip_miles,
#           pickup_census_tract,
#           dropoff_census_tract,
#           payment_type,
#           company,
#           trip_seconds,
#           dropoff_community_area,
#           tips,
#           IF(tips > fare * 0.2, 1, 0) AS big_tipper
#         FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
#         WHERE (ABS(FARM_FINGERPRINT(unique_key)) / 0x7FFFFFFFFFFFFFFF)
#           < {query_sample_rate}""".format(
#    query_sample_rate=_query_sample_rate)

# Beam args to run data processing on DataflowRunner.
#
# TODO(b/151114974): Remove `disk_size_gb` flag after default is increased.
# TODO(b/156874687): Remove `machine_type` after IP addresses are no longer a
#                    scaling bottleneck.
# TODO(b/171733562): Remove `use_runner_v2` once it is the default for Dataflow.
# TODO(step 8): (Optional) Uncomment below to use Dataflow.
# DATAFLOW_BEAM_PIPELINE_ARGS = [
#    '--project=' + GOOGLE_CLOUD_PROJECT,
#    '--runner=DataflowRunner',
#    '--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
#    '--region=' + GOOGLE_CLOUD_REGION,
#
#    # Temporary overrides of defaults.
#    '--disk_size_gb=50',
#    '--machine_type=e2-standard-8',
#    '--experiments=use_runner_v2',
# ]

# A dict which contains the training job parameters to be passed to Google
# Cloud AI Platform. For the full set of parameters supported by Google Cloud AI
# Platform, refer to
# https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#Job
# TODO(step 9): (Optional) Uncomment below to use AI Platform training.
# GCP_AI_PLATFORM_TRAINING_ARGS = {
#     'project': GOOGLE_CLOUD_PROJECT,
#     'region': GOOGLE_CLOUD_REGION,
#     # Starting from TFX 0.14, training on AI Platform uses custom containers:
#     # https://cloud.google.com/ml-engine/docs/containers-overview
#     # You can specify a custom container here. If not specified, TFX will use
#     # a public container image matching the installed version of TFX.
#     # TODO(step 9): (Optional) Set your container name below.
#     'masterConfig': {
#       'imageUri': PIPELINE_IMAGE
#     },
#     # Note that if you do specify a custom container, ensure the entrypoint
#     # calls into TFX's run_executor script (tfx/scripts/run_executor.py)
# }

# A dict which contains the serving job parameters to be passed to Google
# Cloud AI Platform. For the full set of parameters supported by Google Cloud AI
# Platform, refer to
# https://cloud.google.com/ml-engine/reference/rest/v1/projects.models
# TODO(step 9): (Optional) Uncomment below to use AI Platform serving.
# GCP_AI_PLATFORM_SERVING_ARGS = {
#     'model_name': PIPELINE_NAME.replace('-','_'),  # '-' is not allowed.
#     'project_id': GOOGLE_CLOUD_PROJECT,
#     # The region to use when serving the model. See available regions here:
#     # https://cloud.google.com/ml-engine/docs/regions
#     # Note that serving currently only supports a single region:
#     # https://cloud.google.com/ml-engine/reference/rest/v1/projects.models#Model  # pylint: disable=line-too-long
#     'regions': [GOOGLE_CLOUD_REGION],
# }
deep-diver commented 2 years ago

@hugoferrero

another question: where do you set the hardware for training in vertex pipelines?

I am not sure what you mean by this. Do you mean where by the GCP zone?

Seems like config.py is ok. Your error is probably occurred by docker daemon in your local machine. Please check if docker is available and running in your local setup.

hugoferrero commented 2 years ago

I checked docker and daemon is running (I'm using Google Cloud Shell as local machine). I expose the issue in more detail:

(base) heferrero@cloudshell:~/tfx-pipelines (teco-prod-adam-dev-826c)$ docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

(base) heferrero@cloudshell:~/tfx-pipelines (teco-prod-adam-dev-826c)$ tfx pipeline create --pipeline-path=kubeflow_v2_runner.py --engine=vertex --build-image
2022-05-19 15:37:28.681665: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-05-19 15:37:28.681712: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
CLI
Creating pipeline
INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Excluding no splits because exclude_splits is not set.
[Docker] Step 1/4 : FROM tensorflow/tfx:1.7.1[Docker]
[Docker] The push refers to repository [gcr.io/teco-prod-adam-dev-826c/tfx-pipeline]
Traceback (most recent call last):
  File "/home/heferrero/miniconda3/lib/python3.8/site-packages/docker/api/client.py", line 268, in _raise_for_status
    response.raise_for_status()
  File "/home/heferrero/miniconda3/lib/python3.8/site-packages/requests/models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.41/distribution/gcr.io/teco-prod-adam-dev-826c/tfx-pipeline/json

@hugoferrero

another question: where do you set the hardware for training in vertex pipelines?

I am not sure what you mean by this. Do you mean where by the GCP zone?

Sorry, I did not express myself correctly...I read this article https://cloud.google.com/blog/topics/developers-practitioners/model-training-cicd-system-part-i ...in the "Cost" section you said " For this project, we chose n1-standard-4 machine type whose price is $0.19 per hour and NVIDIA_TESLA_K80 accelerator type whose price is $0.45 per hour.".....That's what I meant to ask: In which part of the code did you choose the hardware for the training?

deep-diver commented 2 years ago

Please check this out https://github.com/deep-diver/Model-Training-as-a-CI-CD-System/blob/1496cfdf2f7c7d37cf4727fe7b1a8176352072b7/tfx-pipeline/pipeline/configs.py#L89

deep-diver commented 2 years ago

I am not sure what this error Not Found for url: http+docker://... message means. Maybe somehow the docker image is not created properly?

Are you working on a project open to anyone? If so, let me know the repo URL

hugoferrero commented 2 years ago

I'm seeing that the image is created but it can't be pushed...I'm going to check the permissions. I'll keep you posted next week. I'm working on a project for the company I work for: This is a poc to evaluate tfx to build pipelines. At the moment the repo is not open to anyone. If necessary I will create a public repo so you can check the files.

hugoferrero commented 2 years ago

Hello @deep-diver. The problem was resolved. The error bash request.exceptions.HTTPError: 404 Client Error: Not Found f... occurred due to authentication problems. I had to authenticate my local docker with the container registry. I followed these instructions: https://cloud.google.com/container-registry/docs/advanced-authentication. I was able to run the pipeline in Vertex AI! Thank you very much for the support and for your time.