HoloClean / holoclean

A Machine Learning System for Data Enrichment.
http://www.holoclean.io
Apache License 2.0
514 stars 129 forks source link

Add batched featurization logic to reduce memory usage in repair model #59

Closed jonmio closed 5 years ago

jonmio commented 5 years ago

WIP - PR is missing documentation.

Currently, featurization is done on all cells, and the resulting feature tensors are stored in memory. The resulting feature tensors are very large relative to size of the original dataset, potentially resulting in memory issues.

This PR introduces changes to address these memory concerns. Torch DataLoader is used to iterate over batches of the training/inference data, and featurization happens only during training/inference of the example.

jonmio commented 5 years ago

Added documentation which address most of Richard's comments.

Also, ran some experiments with memory and runtime profiling. On the full hospital dataset, we can reduce memory usage in training from 900MB to 230MB by using an iterator with featurization_batch_size of 10. However, each epoch in training takes about ~20s longer because we re-featurize on each epoch.

Precision and recall haven't been affected.

Regarding caching after the first epoch, I've also discussed this with Mina and we plan on experimenting with it. One workaround he suggested to the shuffling issue is to shuffle within batches.