Closed dzlab closed 3 years ago
Hi @dzlab,
This code works fine with unbalanced datasets (the example we use in the book is highly unbalanced!) The split is random, so you should get approximately the same class proportions in the train, eval and test splits. If you need to enforce stratified sampling, I don't believe this is supported in TFX. But you could prepare separate files for each split and pass them into your TFX pipeline as detailed here: https://www.tensorflow.org/tfx/guide/examplegen#custom_inputoutput_split
Hope that's helpful!
Yeah stratified sampling is what I was looking for, i think it will be tricky to do it outside TFX. If I have a large dataset, I would need yet another infra (e.g. spark) for just prepare data for TFX. What would you recommend?
Hi @dzlab, you can extend the executor of the ExampleGen component to overwrite the split workflow according to your needs. The benefit is that the component will then be executed via Apache Beam (a big advantage of TFX over for example Kubeflow Pipelines SDK). Apache Beam can process the data itself or outsource it to Spark or Flink. And all steps are getting tracked in the metadata store.
Cool that answers my question.
@dzlab You can find an example of how to overwrite the executor in the example: https://github.com/Building-ML-Pipelines/building-machine-learning-pipelines/blob/master/chapters/adv_tfx/Custom_TFX_Components.ipynb (check out the 2nd component)
In the book data splitting was mentioned briefly, e.g.
But how would one handle unbalanced datasets when generate samples for
train
/eval
andtest
? I'm not able to find any examples or documentation of thisexample_gen_pb2.SplitConfig.Split
class