HoloClean / holoclean

A Machine Learning System for Data Enrichment.
http://www.holoclean.io
Apache License 2.0
514 stars 129 forks source link

Remove NULLs from domain: always predict a non-NULL value. #73

Closed richardwu closed 5 years ago

richardwu commented 5 years ago

We remove NULLs from the domain. Therefore if a cell is initially NULL we always predict a non-NULL value unless we cannot generate a non-trivial domain based on co-occurring values from correlated attributes.

We maintain our previous precision and recall as desired (with settings in holoclean_repair_example.py)

23:01:27 - [ INFO] - Precision = 1.00, Recall = 0.46, Repairing Recall = 0.53, F1 = 0.63, Repairing F1 = 0.70, Detected Errors = 435, Total Errors = 509, Correct Repairs = 232, Total Repairs = 459, Total Repairs on correct cells (Grdth present) = 0, Total Repairs on incorrect cells (Grdth present) = 232

Same settings as above but without InitAttrFeaturizer:

23:12:57 - [ INFO] - Precision = 0.95, Recall = 0.85, Repairing Recall = 1.00, F1 = 0.90, Repairing F1 = 0.97, Detected Errors = 435, Total Errors = 509, Correct Repairs = 434, Total Repairs = 683, Total Repairs on correct cells (Grdth present) = 22, Total Repairs on incorrect cells (Grdth present) = 434