deep-diver / semantic-segmentation-ml-pipeline

Machine Learning Pipeline for Semantic Segmentation with TensorFlow Extended (TFX) and various GCP products
https://blog.tensorflow.org/2023/01/end-to-end-pipeline-for-segmentation-tfx-google-cloud-hugging-face.html
Apache License 2.0
93 stars 20 forks source link

ImportExampleGen integrated with Dataflow #20

Closed deep-diver closed 2 years ago

deep-diver commented 2 years ago

The default VM running ImportExampleGen step of Vertex Pipeline does not have sufficient resources to handle the entire raw Sidewalk datasets. Hence, we need to integrate Dataflow into the ImportExampleGen component.

In order to do this, we can call with_beam_pipeline_args() method on the ImportExampleGen with appropriate configurations of Dataflow:

beam_pipeline_args = [
    "--runner=DataflowRunner",
    "--region=" + GOOGLE_CLOUD_REGION,
    "--service_account_email=" + DATAFLOW_SERVICE_ACCOUNT,
    "--machine_type=" + MACHINE_TYPE,
    "--experiments=use_runner_v2",  
    "--max_num_workers=" + str(max_num_workers),
    "--disk_size_gb=" + str(disk_size),
]

example_gen = ImportExampleGen(...)
example_gen.with_beam_pipeline_args(beam_pipeline_args)
deep-diver commented 2 years ago

Progress update:

beam_pipeline_args = [
    "--runner=DataflowRunner",
+   "--project=" + GOOGLE_CLOUD_PROJECT,
    "--region=" + GOOGLE_CLOUD_REGION,
    "--service_account_email=" + DATAFLOW_SERVICE_ACCOUNT,
    "--machine_type=" + MACHINE_TYPE,
    "--experiments=use_runner_v2",  
    "--max_num_workers=" + str(max_num_workers),
    "--disk_size_gb=" + str(disk_size),
]

example_gen = ImportExampleGen(...)
example_gen.with_beam_pipeline_args(beam_pipeline_args)
deep-diver commented 2 years ago

since the full resolution data is large, the training server got OOM with the current setup. The following set did work though.

GCP_AI_PLATFORM_TRAINING_ARGS = {
    vertex_const.ENABLE_VERTEX_KEY: True,
    vertex_const.VERTEX_REGION_KEY: GOOGLE_CLOUD_REGION,
    vertex_training_const.TRAINING_ARGS_KEY: {
        "project": GOOGLE_CLOUD_PROJECT,
        "worker_pool_specs": [
            {
                "machine_spec": {
-                    "machine_type": "n1-standard-4",
+                    "machine_type": "n1-standard-8",
-                    "accelerator_type": "NVIDIA_TESLA_K80",
+                    "accelerator_type": "NVIDIA_TESLA_V100",
                    "accelerator_count": 1,
                },
                "replica_count": 1,
                "container_spec": {
                    "image_uri": PIPELINE_IMAGE,
                },
            }
        ],
    },
    "use_gpu": True,
}