This repository shows how to build a Machine Learning Pipeline for Semantic Segmentation with TensorFlow Extended (TFX) and various GCP products such as Vertex Pipeline, Vertex Training, Vertex Endpoint. Also, the ML pipeline contains a custom TFX component that is integrated with Hugging Face 🤗 Hub - HFPusher
. HFPusher
pushes a trained model to 🤗 Model Hub and, optionally Gradio
application to 🤗 Space Hub with the latest model out of the box.
NOTE: We use U-NET based TensorFlow model from the official tutorial. Since we implement an ML pipeline, U-NET like model could be a good starting point. Other SOTA models like SegFormer from 🤗 Transformers
or DeepLabv3+ will be explored later.
NOTE: The aim of this project is not to serve the most SoTA segmentation model. Our main focus is to demonstrate how to build an end-to-end ML pipeline for semantic segmentation task instead.
Update 17/02/2023: This project received the #TFCommunitySpotlight award.
Update 18/01/2023: We published a blogpost on the TensorFlow blog discussing this project: End-to-End Pipeline for Segmentation with TFX, Google Cloud, and Hugging Face.
project
│
└───notebooks
│ │ gradio_demo.ipynb
│ │ inference_from_SavedModel.ipynb # test inference w/ Vertex Endpoint
│ │ parse_tfrecords_pets.ipynb # test TFRecord parsing
│ │ tfx_pipeline.ipynb # build TFX pipeline within a notebook
│
└───tfrecords
│ │ create_tfrecords_pets.py # script to create TFRecords of PETS dataset
│
└───training_pipeline
└───apps # Gradio app template codebase
└───models # contains files related to model
└───pipeline # definition of TFX pipeline
Inside training_pipeline
the entrypoints for the pipeline runners are defined in
kubeflow_runner.py
and local_runner.py
.
The TFX pipeline is designed to be run on both of local and GCP environments.
$ cd training_pipeline
$ tfx pipeline create --pipeline-path=local_runner.py \
--engine=local
$ tfx pipeline compile --pipeline-path=local_runner.py \
--engine=local
$ tfx run create --pipeline-name=segformer-training-pipeline \
--engine=local
There are two ways to run TFX pipeline on GCP environment(Vertex AI).
First, you can run it manually with the following CLIs. In this case, you should replace GOOGLE_CLOUD_PROJECT
to your GCP project ID in training_pipeline/pipeline/configs.py
beforehand.
$ cd training_pipeline
$ tfx pipeline create --pipeline-path=kubeflow_runner.py \
--engine=vertex
$ tfx pipeline compile --pipeline-path=kubeflow_runner.py \
--engine=vertex
$ tfx run create --pipeline-name=segformer-training-pipeline \
--engine=vertex \
--project=$GCP_PROJECT_ID \
--regeion=$GCP_REGION
You can use workflow_dispatch
feature of GitHub Action to run the pipeline on Vertex AI environment as well. In this case, go to the action tab, then select Trigger Training Pipeline
on the left pane, then Run workflow
on the branch of your choice. The GCP project ID in the input parameters will automatically replace the GOOGLE_CLOUD_PROJECT
in training_pipeline/pipeline/configs.py
. Also it will be injected to the tfx run create
CLI.
For further understading about how GitHub Action is implemented, please refer to its README document.
TFRecord
formatExampleGen
, SchemaGen
, Resolver
, Trainer
, Evaluator
, and Pusher
componentsHFPusher
component to the TFX pipelineSchemaGen
with ImportSchemaGen
for better TFRecords parsing capabilityDataflow
in ImportExampleGen
to handle a large amount of dataset. This feature is included in the code as a reference, but it is not used after we switched the Sidewalk to PETS dataset.Initially, we started our work with the Sidewalks dataset. This dataset contains different stuff and things and is also very high-resolution in nature. To keep the runtime of our pipeline faster and to experiment quicker, we settled with a shallow UNet architecture (from this tutorial). This is why, we also downsampled the Sidewalks dataset quite a bit (128x128, 256x256, etc.). But this led to poor quality models.
To circumvent around this, we used the PETS dataset where the effects of downsampling weren't that visible compared to Sidewalks.
But do note that the approaches showcases in our pipeline can easily be extended to high-resolution segmentation datasets and different
model architectures (as long as they can be serialized as a SavedModel
).
We are thankful to the ML Developer Programs team at Google that provided GCP support.