Difficulty Debugging the Vertex AI Training Pipeline

clopezhrimac commented 1 year ago

I am facing difficulties in debugging the Vertex AI training pipeline. The issue lies in the fact that I cannot run the pipeline locally for testing and debugging purposes. Instead, I have to submit the pipeline to Vertex AI and wait for it to execute in order to obtain debugging information.

The current debugging process involves sending the pipeline with multiple print statements or logging messages to trace the execution flow and pinpoint the exact location of the error. This becomes a slow and tedious cycle as it requires resubmitting the pipeline every time an adjustment or error identification is needed.

Steps to Reproduce the Problem:

Create a training pipeline in Vertex AI.
Submit the pipeline to Vertex AI for execution.
Wait for the pipeline to execute and retrieve the results.
Analyze the generated messages or logs to identify the error.
Make adjustments or corrections to the pipeline.
Repeat steps 2-5 until the problem is identified and resolved.

What would be the best way to handle this training component development cycle?

felix-datatonic commented 1 year ago

Hi @clopezhrimac, thanks for your question! KubeFlow limits the execution of pipelines locally, alternatives pointed by the community are:

creating a local kubernetes cluster and submitting your pipeline to the local cluster (instead of Vertex AI)
isolating individual KubeFlow operations into python- or container-based components to test your business logic locally, however, only for each component separately

In this project, we've optimised components to be inline with option (2). Commands which help you with testing locally are:

make setup-all-components
make test-all-components

or

make setup-component GROUP=<e.g. vertex-components>
make test-component GROUP=<e.g. vertex-components>

Further, we've replaced the python-based training component in the pipelines with a CustomTrainingJob recently which allows you to run your training script locally before submitting to Vertex AI.

While these don't provide full parity between local pipeline runs and submitting pipelines to Vertex AI, these will help you to iterate locally over any changes related to custom python-based components and your training code.

We're currently evaluating the use of CustomPythonPackageTrainingJob, too, and are open to any suggestions you might have!

clopezhrimac commented 1 year ago

What is the diference between CustomTrainingJob y CustomPythonPackageTrainingJob ?

felix-datatonic commented 10 months ago

Hi @clopezhrimac,

Thanks for this issue. Please check out the most recent PR and release.

We've moved away from CustomTrainingJob and CustomPythonPackageTrainingJob since KubeFlow 2.0 supports container components now.

You can cd into the model folder and run your training and prediction code locally before triggering a pipeline in Vertex AI. However, this will only test the training and prediction steps, not the pipeline end-to-end.

GoogleCloudPlatform / vertex-pipelines-end-to-end-samples

Difficulty Debugging the Vertex AI Training Pipeline #49