ImportExampleGen integrated with Dataflow

deep-diver / semantic-segmentation-ml-pipeline

Machine Learning Pipeline for Semantic Segmentation with TensorFlow Extended (TFX) and various GCP products

https://blog.tensorflow.org/2023/01/end-to-end-pipeline-for-segmentation-tfx-google-cloud-hugging-face.html

Apache License 2.0

93 stars 20 forks source link

ImportExampleGen integrated with Dataflow #20

Closed deep-diver closed 2 years ago

deep-diver commented 2 years ago

The default VM running ImportExampleGen step of Vertex Pipeline does not have sufficient resources to handle the entire raw Sidewalk datasets. Hence, we need to integrate Dataflow into the ImportExampleGen component.

In order to do this, we can call with_beam_pipeline_args() method on the ImportExampleGen with appropriate configurations of Dataflow:

beam_pipeline_args = [
    "--runner=DataflowRunner",
    "--region=" + GOOGLE_CLOUD_REGION,
    "--service_account_email=" + DATAFLOW_SERVICE_ACCOUNT,
    "--machine_type=" + MACHINE_TYPE,
    "--experiments=use_runner_v2",  
    "--max_num_workers=" + str(max_num_workers),
    "--disk_size_gb=" + str(disk_size),
]

example_gen = ImportExampleGen(...)
example_gen.with_beam_pipeline_args(beam_pipeline_args)

deep-diver commented 2 years ago

Progress update:

--project flag should be set.
the Service Account in --service_account_email should have permissions of Dataflow Worker and Storage Object Admin.
since Vertex AI Pipeline runs via the default Service Account(compute Service Account), that Service Account should have Service Account User to delegate jobs to the other Service Account

beam_pipeline_args = [
    "--runner=DataflowRunner",
+   "--project=" + GOOGLE_CLOUD_PROJECT,
    "--region=" + GOOGLE_CLOUD_REGION,
    "--service_account_email=" + DATAFLOW_SERVICE_ACCOUNT,
    "--machine_type=" + MACHINE_TYPE,
    "--experiments=use_runner_v2",  
    "--max_num_workers=" + str(max_num_workers),
    "--disk_size_gb=" + str(disk_size),
]

example_gen = ImportExampleGen(...)
example_gen.with_beam_pipeline_args(beam_pipeline_args)

deep-diver commented 2 years ago

since the full resolution data is large, the training server got OOM with the current setup. The following set did work though.

GCP_AI_PLATFORM_TRAINING_ARGS = {
    vertex_const.ENABLE_VERTEX_KEY: True,
    vertex_const.VERTEX_REGION_KEY: GOOGLE_CLOUD_REGION,
    vertex_training_const.TRAINING_ARGS_KEY: {
        "project": GOOGLE_CLOUD_PROJECT,
        "worker_pool_specs": [
            {
                "machine_spec": {
-                    "machine_type": "n1-standard-4",
+                    "machine_type": "n1-standard-8",
-                    "accelerator_type": "NVIDIA_TESLA_K80",
+                    "accelerator_type": "NVIDIA_TESLA_V100",
                    "accelerator_count": 1,
                },
                "replica_count": 1,
                "container_spec": {
                    "image_uri": PIPELINE_IMAGE,
                },
            }
        ],
    },
    "use_gpu": True,
}