HoloClean / holoclean

A Machine Learning System for Data Enrichment.
http://www.holoclean.io
Apache License 2.0
514 stars 129 forks source link

Fixed encoding issue where dataframes were not encoded as unicode #31

Closed richardwu closed 5 years ago

richardwu commented 5 years ago

Closes #26 .

While we were encoding values properly as UTF-8 in Postgres, we did not maintain them as unicode strings in the in-memory dataframes (which we generate stats from).

Some lookups with values from Postgres was giving a key error since the stats dictionaries had Python byte strings as keys rather than unicode strings.

Also simplified some of the code where we were constantly dictifying stats dataframes rather than doing it once at the beginning.

richardwu commented 5 years ago

Before this PR (removing encoding='utf8' from read_csv)

======================================================================
ERROR: test_hospital (test_holoclean_repair.TestHolocleanRepair)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_holoclean_repair.py", line 31, in test_hospital
    hc.repair_errors(featurizers)
  File "/Users/rwu1997/Programming/holoclean/holoclean.py", line 228, in repair_errors
    status, feat_time = self.repair_engine.setup_featurized_ds(featurizers)
  File "/Users/rwu1997/Programming/holoclean/repair/repair.py", line 16, in setup_featurized_ds
    self.feat_dataset = FeaturizedDataset(self.ds, self.env, featurizers)
  File "/Users/rwu1997/Programming/holoclean/repair/featurize/featurize.py", line 16, in __init__
    tensors = [f.create_tensor() for f in featurizers]
  File "/Users/rwu1997/Programming/holoclean/repair/featurize/freqfeat.py", line 31, in create_tensor
    tensors = [self.gen_feat_tensor(res, self.classes) for res in results]
  File "/Users/rwu1997/Programming/holoclean/repair/featurize/freqfeat.py", line 24, in gen_feat_tensor
    prob = float(self.single_stats[attribute][val])/float(self.total)
KeyError: u'surgeryxpatientsxneedingxhairxremovedxfromxthexsurgicalxareaxbeforexsurgery& xwhoxhadxhairxremovedxusingxaxsaferxmethodx(electricxclippersxorxhairxremovalxcreamx\xef\xbf\xbdcxnotxaxrazor)'

After this PR

INFO:root:Precision = 0.94, Recall = 0.69, Repairing Recall = 0.80, F1 = 0.80, Repairing F1 = 0.87, Detected Errors = 438, Total Errors = 509, Correct Repairs = 350, Total Repairs = 371, Total Repairs (Grdth present) = 371