Closed fgregg closed 11 years ago
@nilesh-c, this is the right place to report bugs. Could you run dedupe with verbose output
python example/csv_example/csv_example.py -vv
And paste the trace here?
@fgregg Thanks, I was confused about whether I should open a new issue for this.
Here is the output you asked for:
python examples/csv_example/csv_example.py -vv
importing data ...
reading labeled examples from csv_example_training.json
INFO:root:reading training from file
INFO:root:using cross validation to find optimum alpha...
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:Average Score: 0.300000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:Average Score: 0.300000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:Average Score: 0.300000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:Average Score: 0.300000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:Average Score: 0.300000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:Average Score: 0.300000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:Average Score: 0.300000
INFO:root:optimum alpha: 1.000000
INFO:root:Learned Weights
INFO:root:('Phone', -0.34075412154197693)
INFO:root:('Address', -0.10351292788982391)
INFO:root:('Zip', 0.033473193645477295)
INFO:root:('Site name', -0.2907579243183136)
INFO:root:('Phone: not_missing', 0.0)
INFO:root:('Zip: not_missing', 0.0)
INFO:root:('bias', 1.360727474729842)
starting active labeling...
INFO:root:calculated fieldDistances in 2.80804491043 seconds
INFO:root:finding the next uncertain pair ...
Phone : 6452300
Address : 1343 n. california
Zip :
Site name : casa central - csc development program
Phone : 6452300 Address : 1343 n. california ave Zip : 60622 Site name : community service center
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
INFO:root:finding the next uncertain pair ...
Phone :
Address : 2718 w 59th st
Zip :
Site name : easter seals society of metropolitan chicago - the keeper's inst.
Phone : 5211600 Address : 2929 w 19th street Zip : 60623 Site name : carole robertson center for learning
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
INFO:root:finding the next uncertain pair ...
Phone : 5341804
Address : 1326 s. avers ave
Zip :
Site name : henson
Phone : 5341665 Address : 1616 s. avers Zip : 60623 Site name : penn (blended)
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
f
Finished labeling
INFO:root:using cross validation to find optimum alpha...
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:Average Score: 0.350000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:Average Score: 0.350000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:Average Score: 0.350000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:Average Score: 0.350000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:Average Score: 0.350000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:0 true predicted dupes in training set
DEBUG:root:Recall 0.000000
DEBUG:root:F-Score 0.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:Average Score: 0.300000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:0 true predicted dupes in training set
DEBUG:root:Recall 0.000000
DEBUG:root:F-Score 0.000000
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
DEBUG:root:1 duplicates in validation set
DEBUG:root:1 true predicted dupes in training set
DEBUG:root:Recall 1.000000
DEBUG:root:1 predicted duplicates
DEBUG:root:Precision 1.000000
DEBUG:root:F-Score 1.000000
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
WARNING:root:not real positives, change size of folds
DEBUG:root:Average Score: 0.300000
INFO:root:optimum alpha: 0.010000
INFO:root:Learned Weights
INFO:root:('Phone', -1.3635646104812622)
INFO:root:('Address', -0.12422920018434525)
INFO:root:('Zip', 1.5542700290679932)
INFO:root:('Site name', -0.5060533285140991)
INFO:root:('Phone: not_missing', 0.0)
INFO:root:('Zip: not_missing', 0.0)
INFO:root:('bias', 2.8320163776556644)
blocking...
INFO:root:num chunks 2
INFO:root:all scores 4000
Traceback (most recent call last):
File "examples/csv_example/csv_example.py", line 156, in
hmm... okay. can you switch over to the dev branch and try it?
I rewrote the semisuperivednonduplicates function over there, but haven't brought the changes back over to master.
@fgregg It worked. But it gets stuck during clustering with the same type of error. This time it is:
File "./examples/csv_example/csv_example.py", line 184, in
I tried to make sense of what you changed by comparing the master and dev versions of semiSupervisedNonDuplicates but didn't really succeed. Could you give me a brief explanation? Perhaps I'll learn something. I am pasting the verbose output again:
$ python ./examples/csv_example/csv_example.py -vv
importing data ...
reading from csv_example_learned_settings
blocking...
INFO:root:Stop word threshold: 500
INFO:root:Stop word: Address, n, 611
INFO:root:Stop word: Address, st, 654
INFO:root:Stop word: Address, s, 1162
INFO:root:Stop word: Address, w, 1122
INFO:root:Stop word: Address, ave, 556
INFO:root:creating TF/IDF canopies
INFO:root:1/2 field 0.40 Phone
INFO:root:2/2 field 0.20 Address
INFO:root:Maximum expected recall and precision
INFO:root:recall: 1.000
INFO:root:precision: 0.916
INFO:root:With threshold: 0.199
clustering...
INFO:root:num chunks 3
INFO:root:all scores 5006
Traceback (most recent call last):
File "./examples/csv_example/csv_example.py", line 184, in
That's very strange. Can you e-mail me your csv_example_traing.json file
On Mon, Apr 29, 2013 at 12:26 AM, nilesh-c notifications@github.com wrote:
@fgregg https://github.com/fgregg It worked. But it gets stuck during clustering with the same type of error. This time it is: File "./examples/csv_example/csv_example.py", line 184, in clustered_dupes = deduper.duplicateClusters(blocked_data, threshold) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/api.py", line 400, in duplicateClusters threshold) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/core.py", line 171, in scoreDuplicates scored_pairs = numpy.unique(scored_pairs)
I tried to make sense of what you changed by comparing the master and dev versions of semiSupervisedNonDuplicates but didn't really succeed. Could you give me a brief explanation? Perhaps I'll learn something. I am pasting the verbose output again:
$ python ./examples/csv_example/csv_example.py -vv
importing data ... reading from csv_example_learned_settings blocking... INFO:root:Stop word threshold: 500 INFO:root:Stop word: Address, n, 611 INFO:root:Stop word: Address, st, 654 INFO:root:Stop word: Address, s, 1162 INFO:root:Stop word: Address, w, 1122 INFO:root:Stop word: Address, ave, 556 INFO:root:creating TF/IDF canopies INFO:root:1/2 field 0.40 Phone INFO:root:2/2 field 0.20 Address INFO:root:Maximum expected recall and precision INFO:root:recall: 1.000 INFO:root:precision: 0.916 INFO:root:With threshold: 0.199 clustering... INFO:root:num chunks 3 INFO:root:all scores 5006
Traceback (most recent call last): File "./examples/csv_example/csv_example.py", line 184, in clustered_dupes = deduper.duplicateClusters(blocked_data, threshold) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/api.py", line 400, in duplicateClusters threshold) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/core.py", line 171, in scoreDuplicates
scored_pairs = numpy.unique(scored_pairs) File "/usr/lib/pymodules/python2.7/numpy/lib/arraysetops.py", line 197, in unique flag = np.concatenate(([True], ar[1:] != ar[:-1])) ValueError: shape mismatch: objects cannot be broadcast to a single shape
— Reply to this email directly or view it on GitHubhttps://github.com/open-city/dedupe/issues/122#issuecomment-17150655 .
773.888.2718 2231 N. Monticello Ave Chicago, IL 60647
OK. I've sent the file to your gmail address.
@fgregg I removed my python-numpy package from ubuntu and downloaded the numpy-1.7.0.tar.gz (source), built and installed it. And now this works with the expected output. I guess dedupe doesn't like numpy 1.5.1. :smile:
Great, you already found an important bug! Closed by 3f9595e1ec32e1c4f91ccd82d21691c9fcb85c92
python examples/csv_example/csv_example.py
This gives the following output:
importing data ... starting active labeling... Phone : 5348980 Address : 4820 w walton Zip : 60651 Site name : chicago public schools mcnair academy center, ronald e.
Phone : 7853940 Address : 1 e 113th st Zip : 60628 Site name : chicago commons association v & j day care center
Do these records refer to the same thing? (y)es / (n)o / (u)nsure / (f)inished n Phone :
Address : 2434 s kildare ave Zip :
Site name : el valor - carlos cantu
Phone :
Address : 2434 s kildare ave Zip :
Site name : el valor - carlos cantu
Do these records refer to the same thing? (y)es / (n)o / (u)nsure / (f)inished y Phone : 5353035 Address : 7240 s. wabash Zip :
Site name : deneen
Phone :
Address : 4647 w. washington Zip : 60644 Site name : home of life community dev. corp. home of life just for you (773)-626-8655
Do these records refer to the same thing? (y)es / (n)o / (u)nsure / (f)inished n
------------------------- < many more of these until I entered 'f' > ...
WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds
------------------------- < a whole lot of the above, and then: > ...
blocking... Traceback (most recent call last): File "./examples/csv_example/csv_example.py", line 156, in
blocker = deduper.blockingFunction()
File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/api.py", line 261, in blockingFunction
self.predicates = self._learnBlocking(ppc, uncovered_dupes)
File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/api.py", line 343, in _learnBlocking
self.data_model)
File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/training.py", line 165, in semiSupervisedNonDuplicates
threshold=0)
File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/core.py", line 163, in scoreDuplicates
scored_pairs = numpy.unique(scored_pairs)
File "/usr/lib/pymodules/python2.7/numpy/lib/arraysetops.py", line 197, in unique
flag = np.concatenate(([True], ar[1:] != ar[:-1]))
ValueError: shape mismatch: objects cannot be broadcast to a single shape
Running python test/test_dedupe.py gives:
.....
Ran 5 tests in 0.020s
I have Python 2.7.2+, networkx 1.7, fastcluster 1.1.9, hcluster 0.2.0, numpy 1.5.1.
Please help me out. I'm looking forward to hearing from you soon. :)