Closed richardwu closed 5 years ago
Before this PR (removing encoding='utf8'
from read_csv
)
======================================================================
ERROR: test_hospital (test_holoclean_repair.TestHolocleanRepair)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_holoclean_repair.py", line 31, in test_hospital
hc.repair_errors(featurizers)
File "/Users/rwu1997/Programming/holoclean/holoclean.py", line 228, in repair_errors
status, feat_time = self.repair_engine.setup_featurized_ds(featurizers)
File "/Users/rwu1997/Programming/holoclean/repair/repair.py", line 16, in setup_featurized_ds
self.feat_dataset = FeaturizedDataset(self.ds, self.env, featurizers)
File "/Users/rwu1997/Programming/holoclean/repair/featurize/featurize.py", line 16, in __init__
tensors = [f.create_tensor() for f in featurizers]
File "/Users/rwu1997/Programming/holoclean/repair/featurize/freqfeat.py", line 31, in create_tensor
tensors = [self.gen_feat_tensor(res, self.classes) for res in results]
File "/Users/rwu1997/Programming/holoclean/repair/featurize/freqfeat.py", line 24, in gen_feat_tensor
prob = float(self.single_stats[attribute][val])/float(self.total)
KeyError: u'surgeryxpatientsxneedingxhairxremovedxfromxthexsurgicalxareaxbeforexsurgery& xwhoxhadxhairxremovedxusingxaxsaferxmethodx(electricxclippersxorxhairxremovalxcreamx\xef\xbf\xbdcxnotxaxrazor)'
After this PR
INFO:root:Precision = 0.94, Recall = 0.69, Repairing Recall = 0.80, F1 = 0.80, Repairing F1 = 0.87, Detected Errors = 438, Total Errors = 509, Correct Repairs = 350, Total Repairs = 371, Total Repairs (Grdth present) = 371
Closes #26 .
While we were encoding values properly as UTF-8 in Postgres, we did not maintain them as unicode strings in the in-memory dataframes (which we generate stats from).
Some lookups with values from Postgres was giving a key error since the stats dictionaries had Python byte strings as keys rather than unicode strings.
Also simplified some of the code where we were constantly
dictifying
stats dataframes rather than doing it once at the beginning.