dedupeio / dedupe-examples

:id: Examples for using the dedupe library
MIT License
407 stars 216 forks source link

pgsql_big_dedupe_example fails #129

Open wilko77 opened 2 years ago

wilko77 commented 2 years ago

I ran the postgres example as-is with a postgres database version 14.2 and dedupe version 2.0.17. After training and clustering, it will eventually fail during 'writing results' with the following error:

writing results
WARNING:dedupe.clustering:A component contained 656982 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.8445158759995937
Traceback (most recent call last):
  File "/Users/******/Code/dedupe-examples/pgsql_big_dedupe_example/pgsql_big_dedupe_example.py", line 304, in <module>
    write_cur.copy_expert('COPY entity_map FROM STDIN WITH CSV',
psycopg2.errors.QueryCanceled: COPY from stdin failed: error in .read() call: ValueError Iteration of zero-sized operands is not enabled
CONTEXT:  COPY entity_map, line 1
evanmuller commented 2 years ago

I'm experiencing a similar issue with the mysql_example:

creating entity_map database
A component contained 56250 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.9027395568206275
Traceback (most recent call last):
  File "mysql_example.py", line 277, in <module>
    write_cur.executemany('INSERT INTO entity_map VALUES (%s, %s, %s)',
  File "/home/ubuntu/.local/lib/python3.8/site-packages/MySQLdb/cursors.py", line 230, in executemany
    return self._do_execute_many(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/MySQLdb/cursors.py", line 258, in _do_execute_many
    for arg in args:
  File "mysql_example.py", line 50, in cluster_ids
    for cluster, scores in clustered_dupes:
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/api.py", line 341, in cluster
    yield from clustering.cluster(scores, threshold)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 238, in cluster
    for sub_graph in dupe_sub_graphs:
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 51, in connected_components
    yield from _connected_components(edgelist, max_components)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 99, in _connected_components
    for sub_graph in _connected_components(filtered_sub_graph, max_components):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 59, in _connected_components
    component_stops = union_find(edgelist)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 114, in union_find
    it = numpy.nditer(edgelist, ["external_loop"])
ValueError: Iteration of zero-sized operands is not enabled
evanmuller commented 2 years ago

I got the mysql example to work by adding the "zerosize_ok" option to numpy.nditer in clustering.py. I imagine that this would also resolve the OP postres example. I'm not a python developer so I don't want to issue a PR for this until I have a better understanding of what's going on. In the union_find function in clustering.py, I changed...

it = numpy.nditer(edgelist, ["external_loop"])

to...

it = numpy.nditer(edgelist, ["external_loop", "zerosize_ok"])
twright8 commented 1 year ago

This still doesnt work for me, even with the fix above. Any new solutions?