dedupeio / dedupe-examples

:id: Examples for using the dedupe library
MIT License
404 stars 216 forks source link

Records do not line up with data model #89

Closed Pixelartist closed 5 years ago

Pixelartist commented 5 years ago

Hello,

I pretty much have the same issue as @tendres but can not see any postgres url or any other db misconfigs. I tried the "pgsql_big_dedupe_example_init_db" and it fails. I actually started to switch to my own data but this fails with the same reason. It seems that I fail to understand the alignment of the data model and the selection of the data. I checked that the columns are the same (for my own project) but failed. Now I am doing the example and fail again. Can someone please elaborate on how the deduper eats the data and what I could possibly do wrong?

Additional Information: Running in pyth3, changed db integration to single line by replacing

# Set the database connection from environment variable using
# [dj_database_url](https://github.com/kennethreitz/dj-database-url)
# For example:
#   export DATABASE_URL=postgres://user:password@host/mydatabase
# db_conf = dj_database_url.config()
#
# if not db_conf:
#     raise Exception(
#         'set DATABASE_URL environment variable with your connection, e.g. '
#         'export DATABASE_URL=postgres://user:password@host/mydatabase'
#     )
#
# con = psycopg2.connect(database=db_conf['NAME'],
#                        user=db_conf['USER'],
#                        password=db_conf['PASSWORD'],
#                        host=db_conf['HOST'],
#                        cursor_factory=psycopg2.extras.RealDictCursor)

with

con = psycopg2.connect("dbname='mydb' user='myname' host='myhost' password='mypass' port='myport'")

Where I think the error is: I am using

DONOR_SELECT = "SELECT donor_id, city, name, zip, state, address," \
               "occupation, employer, person from dev_manuel.processed_donors"

and

    fields = [{'field': 'name', 'variable name': 'name', 'type': 'String'},
              {'field': 'address', 'type': 'String', 'variable name': 'address', 'has missing': True},
              {'field': 'city', 'type': 'String', 'has missing': True},
              {'field': 'state', 'type': 'String'},
              {'field': 'zip', 'type': 'String', 'has missing': True},
              {'field': 'person', 'variable name': 'person', 'type': 'Categorical', 'categories': [0, 1]},
              {'type': 'Interaction', 'interaction variables': ['person', 'address']},
              {'type': 'Interaction', 'interaction variables': ['name', 'address']}
              ]

from the example - https://dedupeio.github.io/dedupe-examples/docs/pgsql_big_dedupe_example.html - which is different to the github version btw.

printing the object leaves me with (1 selected row) 706027: (706028, 'chicago', 'zylpha crowe', '60625', 'il', '4564 n. virginia ave.', 'none', 'none', 0)

But i seems this does not match. Any help please?

Pixelartist commented 5 years ago

I think I found some nice information - the format for the deduper to digest is:

{
'Id': '3336',
'Source': 'purple_binder_early_childhood.csv',
'Site name': 'haymarket center',
...
}

Given this I would assume I need a different output from the database - more "Key Value" and comma. I will try this now.

Pixelartist commented 5 years ago

As mentioned - it was exactly that problem. By using cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor) I was able to use the data in the dictionary format which is necessary for the deduper object