GoogleCloudPlatform / vertex-pipelines-end-to-end-samples

Apache License 2.0
218 stars 85 forks source link

Why Data Validation is removed from xgboost Training Pipeline? #52

Closed clopezhrimac closed 1 year ago

clopezhrimac commented 1 year ago

I just notice that in last changes in master branch there is no data validation components, Why it was decided to remove the data validation components in training pipeline

felix-datatonic commented 1 year ago

Hi @clopezhrimac,

Based on using the solution for a variety of projects, we've noticed a learning curve for TensorFlow Data Validation (TFDV). While TFDV is a great fit for TensorFlow Extended (TFX), it can become a challenge for end-users of this project as we're aiming for a template which is easy to set up, adapt, and productionalise. TFDV in KubeFlow pipeline requires a fair amount of custom code, the installation of TensorFlow dependencies slows down pipeline runs, and to achieve a performant execution DataFlow is recommended adding another layer of complexity for users.

We're aiming to bridge the gap of data validation in the current version with the following features soon:

  1. adding another data validation tool with a flat learning curve (e.g. great expectations)
  2. allow users of the template to use TFDV components (and other components) optionally

Let us know if you have further ideas about adding data validation to the template!