dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.15k stars 551 forks source link

Error in semi-supervised nonduplicates #122

Closed fgregg closed 11 years ago

fgregg commented 11 years ago

python examples/csv_example/csv_example.py

This gives the following output:

importing data ... starting active labeling... Phone : 5348980 Address : 4820 w walton Zip : 60651 Site name : chicago public schools mcnair academy center, ronald e.

Phone : 7853940 Address : 1 e 113th st Zip : 60628 Site name : chicago commons association v & j day care center

Do these records refer to the same thing? (y)es / (n)o / (u)nsure / (f)inished n Phone :
Address : 2434 s kildare ave Zip :
Site name : el valor - carlos cantu

Phone :
Address : 2434 s kildare ave Zip :
Site name : el valor - carlos cantu

Do these records refer to the same thing? (y)es / (n)o / (u)nsure / (f)inished y Phone : 5353035 Address : 7240 s. wabash Zip :
Site name : deneen

Phone :
Address : 4647 w. washington Zip : 60644 Site name : home of life community dev. corp. home of life just for you (773)-626-8655

Do these records refer to the same thing? (y)es / (n)o / (u)nsure / (f)inished n

------------------------- < many more of these until I entered 'f' > ...

WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds

------------------------- < a whole lot of the above, and then: > ...

blocking... Traceback (most recent call last): File "./examples/csv_example/csv_example.py", line 156, in blocker = deduper.blockingFunction() File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/api.py", line 261, in blockingFunction self.predicates = self._learnBlocking(ppc, uncovered_dupes) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/api.py", line 343, in _learnBlocking self.data_model) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/training.py", line 165, in semiSupervisedNonDuplicates threshold=0) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/core.py", line 163, in scoreDuplicates scored_pairs = numpy.unique(scored_pairs) File "/usr/lib/pymodules/python2.7/numpy/lib/arraysetops.py", line 197, in unique flag = np.concatenate(([True], ar[1:] != ar[:-1])) ValueError: shape mismatch: objects cannot be broadcast to a single shape


Running python test/test_dedupe.py gives:

.....

Ran 5 tests in 0.020s


I have Python 2.7.2+, networkx 1.7, fastcluster 1.1.9, hcluster 0.2.0, numpy 1.5.1.

Please help me out. I'm looking forward to hearing from you soon. :)

fgregg commented 11 years ago

@nilesh-c, this is the right place to report bugs. Could you run dedupe with verbose output

python example/csv_example/csv_example.py -vv

And paste the trace here?

nilesh-c commented 11 years ago

@fgregg Thanks, I was confused about whether I should open a new issue for this.

Here is the output you asked for:

python examples/csv_example/csv_example.py -vv importing data ... reading labeled examples from csv_example_training.json INFO:root:reading training from file INFO:root:using cross validation to find optimum alpha... DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:Average Score: 0.300000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:Average Score: 0.300000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:Average Score: 0.300000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:Average Score: 0.300000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:Average Score: 0.300000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:Average Score: 0.300000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:Average Score: 0.300000 INFO:root:optimum alpha: 1.000000 INFO:root:Learned Weights INFO:root:('Phone', -0.34075412154197693) INFO:root:('Address', -0.10351292788982391) INFO:root:('Zip', 0.033473193645477295) INFO:root:('Site name', -0.2907579243183136) INFO:root:('Phone: not_missing', 0.0) INFO:root:('Zip: not_missing', 0.0) INFO:root:('bias', 1.360727474729842) starting active labeling... INFO:root:calculated fieldDistances in 2.80804491043 seconds INFO:root:finding the next uncertain pair ... Phone : 6452300 Address : 1343 n. california Zip :
Site name : casa central - csc development program

Phone : 6452300 Address : 1343 n. california ave Zip : 60622 Site name : community service center

Do these records refer to the same thing? (y)es / (n)o / (u)nsure / (f)inished y INFO:root:finding the next uncertain pair ... Phone :
Address : 2718 w 59th st Zip :
Site name : easter seals society of metropolitan chicago - the keeper's inst.

Phone : 5211600 Address : 2929 w 19th street Zip : 60623 Site name : carole robertson center for learning

Do these records refer to the same thing? (y)es / (n)o / (u)nsure / (f)inished n INFO:root:finding the next uncertain pair ... Phone : 5341804 Address : 1326 s. avers ave Zip :
Site name : henson

Phone : 5341665 Address : 1616 s. avers Zip : 60623 Site name : penn (blended)

Do these records refer to the same thing? (y)es / (n)o / (u)nsure / (f)inished f Finished labeling INFO:root:using cross validation to find optimum alpha... DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:Average Score: 0.350000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:Average Score: 0.350000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:Average Score: 0.350000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:Average Score: 0.350000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:Average Score: 0.350000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:0 true predicted dupes in training set DEBUG:root:Recall 0.000000 DEBUG:root:F-Score 0.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:Average Score: 0.300000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:0 true predicted dupes in training set DEBUG:root:Recall 0.000000 DEBUG:root:F-Score 0.000000 WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 DEBUG:root:1 duplicates in validation set DEBUG:root:1 true predicted dupes in training set DEBUG:root:Recall 1.000000 DEBUG:root:1 predicted duplicates DEBUG:root:Precision 1.000000 DEBUG:root:F-Score 1.000000 WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds WARNING:root:not real positives, change size of folds DEBUG:root:Average Score: 0.300000 INFO:root:optimum alpha: 0.010000 INFO:root:Learned Weights INFO:root:('Phone', -1.3635646104812622) INFO:root:('Address', -0.12422920018434525) INFO:root:('Zip', 1.5542700290679932) INFO:root:('Site name', -0.5060533285140991) INFO:root:('Phone: not_missing', 0.0) INFO:root:('Zip: not_missing', 0.0) INFO:root:('bias', 2.8320163776556644) blocking... INFO:root:num chunks 2 INFO:root:all scores 4000 Traceback (most recent call last): File "examples/csv_example/csv_example.py", line 156, in blocker = deduper.blockingFunction() File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/api.py", line 261, in blockingFunction self.predicates = self._learnBlocking(ppc, uncovered_dupes) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/api.py", line 343, in _learnBlocking self.data_model) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/training.py", line 165, in semiSupervisedNonDuplicates threshold=0) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/core.py", line 163, in scoreDuplicates scored_pairs = numpy.unique(scored_pairs) File "/usr/lib/pymodules/python2.7/numpy/lib/arraysetops.py", line 197, in unique flag = np.concatenate(([True], ar[1:] != ar[:-1])) ValueError: shape mismatch: objects cannot be broadcast to a single shape

fgregg commented 11 years ago

hmm... okay. can you switch over to the dev branch and try it?

I rewrote the semisuperivednonduplicates function over there, but haven't brought the changes back over to master.

nilesh-c commented 11 years ago

@fgregg It worked. But it gets stuck during clustering with the same type of error. This time it is: File "./examples/csv_example/csv_example.py", line 184, in clustered_dupes = deduper.duplicateClusters(blocked_data, threshold) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/api.py", line 400, in duplicateClusters threshold) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/core.py", line 171, in scoreDuplicates scored_pairs = numpy.unique(scored_pairs)

I tried to make sense of what you changed by comparing the master and dev versions of semiSupervisedNonDuplicates but didn't really succeed. Could you give me a brief explanation? Perhaps I'll learn something. I am pasting the verbose output again:

$ python ./examples/csv_example/csv_example.py -vv

importing data ... reading from csv_example_learned_settings blocking... INFO:root:Stop word threshold: 500 INFO:root:Stop word: Address, n, 611 INFO:root:Stop word: Address, st, 654 INFO:root:Stop word: Address, s, 1162 INFO:root:Stop word: Address, w, 1122 INFO:root:Stop word: Address, ave, 556 INFO:root:creating TF/IDF canopies INFO:root:1/2 field 0.40 Phone INFO:root:2/2 field 0.20 Address INFO:root:Maximum expected recall and precision INFO:root:recall: 1.000 INFO:root:precision: 0.916 INFO:root:With threshold: 0.199 clustering... INFO:root:num chunks 3 INFO:root:all scores 5006 Traceback (most recent call last): File "./examples/csv_example/csv_example.py", line 184, in clustered_dupes = deduper.duplicateClusters(blocked_data, threshold) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/api.py", line 400, in duplicateClusters threshold) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/core.py", line 171, in scoreDuplicates scored_pairs = numpy.unique(scored_pairs) File "/usr/lib/pymodules/python2.7/numpy/lib/arraysetops.py", line 197, in unique flag = np.concatenate(([True], ar[1:] != ar[:-1])) ValueError: shape mismatch: objects cannot be broadcast to a single shape

fgregg commented 11 years ago

That's very strange. Can you e-mail me your csv_example_traing.json file

On Mon, Apr 29, 2013 at 12:26 AM, nilesh-c notifications@github.com wrote:

@fgregg https://github.com/fgregg It worked. But it gets stuck during clustering with the same type of error. This time it is: File "./examples/csv_example/csv_example.py", line 184, in clustered_dupes = deduper.duplicateClusters(blocked_data, threshold) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/api.py", line 400, in duplicateClusters threshold) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/core.py", line 171, in scoreDuplicates scored_pairs = numpy.unique(scored_pairs)

I tried to make sense of what you changed by comparing the master and dev versions of semiSupervisedNonDuplicates but didn't really succeed. Could you give me a brief explanation? Perhaps I'll learn something. I am pasting the verbose output again:

$ python ./examples/csv_example/csv_example.py -vv

importing data ... reading from csv_example_learned_settings blocking... INFO:root:Stop word threshold: 500 INFO:root:Stop word: Address, n, 611 INFO:root:Stop word: Address, st, 654 INFO:root:Stop word: Address, s, 1162 INFO:root:Stop word: Address, w, 1122 INFO:root:Stop word: Address, ave, 556 INFO:root:creating TF/IDF canopies INFO:root:1/2 field 0.40 Phone INFO:root:2/2 field 0.20 Address INFO:root:Maximum expected recall and precision INFO:root:recall: 1.000 INFO:root:precision: 0.916 INFO:root:With threshold: 0.199 clustering... INFO:root:num chunks 3 INFO:root:all scores 5006

Traceback (most recent call last): File "./examples/csv_example/csv_example.py", line 184, in clustered_dupes = deduper.duplicateClusters(blocked_data, threshold) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/api.py", line 400, in duplicateClusters threshold) File "/usr/local/lib/python2.7/dist-packages/Dedupe-0.3-py2.7-linux-i686.egg/dedupe/core.py", line 171, in scoreDuplicates

scored_pairs = numpy.unique(scored_pairs) File "/usr/lib/pymodules/python2.7/numpy/lib/arraysetops.py", line 197, in unique flag = np.concatenate(([True], ar[1:] != ar[:-1])) ValueError: shape mismatch: objects cannot be broadcast to a single shape

— Reply to this email directly or view it on GitHubhttps://github.com/open-city/dedupe/issues/122#issuecomment-17150655 .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

nilesh-c commented 11 years ago

OK. I've sent the file to your gmail address.

nilesh-c commented 11 years ago

@fgregg I removed my python-numpy package from ubuntu and downloaded the numpy-1.7.0.tar.gz (source), built and installed it. And now this works with the expected output. I guess dedupe doesn't like numpy 1.5.1. :smile:

fgregg commented 11 years ago

Great, you already found an important bug! Closed by 3f9595e1ec32e1c4f91ccd82d21691c9fcb85c92