dedupeio / dedupe-examples

:id: Examples for using the dedupe library
MIT License
406 stars 214 forks source link

Update call to dedupe.blocker.index in pgsql_big_dedupe_example for dedupe v0.8.0.1.7. #25

Closed justinmanley closed 9 years ago

justinmanley commented 9 years ago

As it stands, pgsql_big_dedupe_example fails when run with dedupe v0.8.0.1.7 (the latest currently available on pypi). Running python pgsql_big_dedupe_example.py generates the following output:

INFO:root:Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
reading from  pgsql_big_dedupe_example_settings
blocking...
creating blocking_map database
creating inverted index
Traceback (most recent call last):
  File "pgsql_big_dedupe_example.py", line 177, in <module>
    deduper.blocker.index(field_data, field)
  File "/home/ec2-user/dedupe-examples/pgsql_big_dedupe_example/local/lib/python2.7/site-packages/dedupe/blocking.py", line 74, in index
    index.index(preprocess(doc))
  File "/home/ec2-user/dedupe-examples/pgsql_big_dedupe_example/local/lib/python2.7/site-packages/dedupe/predicates.py", line 161, in preprocess
    return tuple(ngrams(doc.replace(' ', ''), 2))
AttributeError: 'tuple' object has no attribute 'replace'

This PR fixes this issue by bringing the call to deduper.blocker.index up to date with the current documentation for dedupe v0.8:

blocker.index(_fielddata, field) Indexes the data from a field for use in a index predicate.

Parameters:
field data (set) – The unique field values that appear in your data. field (string) – The name of the field

for field in deduper.blocker.index_fields :
    field_data = set(record[field] for record in data)
    deduper.index(field_data, field)