Building-ML-Pipelines / building-machine-learning-pipelines

Code repository for the O'Reilly publication "Building Machine Learning Pipelines" by Hannes Hapke & Catherine Nelson
MIT License
583 stars 250 forks source link

Chapter 3 Data Ingestion #23

Closed dzlab closed 3 years ago

dzlab commented 3 years ago

In the book data splitting was mentioned briefly, e.g.

base_dir = os.getcwd()
output = example_gen_pb2.Output(
    split_config=example_gen_pb2.SplitConfig(splits=[ 
        example_gen_pb2.SplitConfig.Split(name='train', hash_buckets=6),
        example_gen_pb2.SplitConfig.Split(name='eval', hash_buckets=2),
        example_gen_pb2.SplitConfig.Split(name='test', hash_buckets=2)
    ]))

examples = dsl_utils.external_input(os.path.join(base_dir, 'data'))
example_gen = CsvExampleGen(input=examples, output_config=output) 

context.run(example_gen)

But how would one handle unbalanced datasets when generate samples for train/eval and test? I'm not able to find any examples or documentation of this example_gen_pb2.SplitConfig.Split class

catherinenelson1 commented 3 years ago

Hi @dzlab,

This code works fine with unbalanced datasets (the example we use in the book is highly unbalanced!) The split is random, so you should get approximately the same class proportions in the train, eval and test splits. If you need to enforce stratified sampling, I don't believe this is supported in TFX. But you could prepare separate files for each split and pass them into your TFX pipeline as detailed here: https://www.tensorflow.org/tfx/guide/examplegen#custom_inputoutput_split

Hope that's helpful!

dzlab commented 3 years ago

Yeah stratified sampling is what I was looking for, i think it will be tricky to do it outside TFX. If I have a large dataset, I would need yet another infra (e.g. spark) for just prepare data for TFX. What would you recommend?

hanneshapke commented 3 years ago

Hi @dzlab, you can extend the executor of the ExampleGen component to overwrite the split workflow according to your needs. The benefit is that the component will then be executed via Apache Beam (a big advantage of TFX over for example Kubeflow Pipelines SDK). Apache Beam can process the data itself or outsource it to Spark or Flink. And all steps are getting tracked in the metadata store.

dzlab commented 3 years ago

Cool that answers my question.

hanneshapke commented 3 years ago

@dzlab You can find an example of how to overwrite the executor in the example: https://github.com/Building-ML-Pipelines/building-machine-learning-pipelines/blob/master/chapters/adv_tfx/Custom_TFX_Components.ipynb (check out the 2nd component)