Add batched featurization logic to reduce memory usage in repair model

HoloClean / holoclean

A Machine Learning System for Data Enrichment.

Apache License 2.0

514 stars 129 forks source link

Added documentation which address most of Richard's comments.

Also, ran some experiments with memory and runtime profiling. On the full hospital dataset, we can reduce memory usage in training from 900MB to 230MB by using an iterator with featurization_batch_size of 10. However, each epoch in training takes about ~20s longer because we re-featurize on each epoch.

Precision and recall haven't been affected.

Regarding caching after the first epoch, I've also discussed this with Mina and we plan on experimenting with it. One workaround he suggested to the shuffling issue is to shuffle within batches.

HoloClean / holoclean

Add batched featurization logic to reduce memory usage in repair model #59