Open wilko77 opened 2 years ago
I'm experiencing a similar issue with the mysql_example:
creating entity_map database
A component contained 56250 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.9027395568206275
Traceback (most recent call last):
File "mysql_example.py", line 277, in <module>
write_cur.executemany('INSERT INTO entity_map VALUES (%s, %s, %s)',
File "/home/ubuntu/.local/lib/python3.8/site-packages/MySQLdb/cursors.py", line 230, in executemany
return self._do_execute_many(
File "/home/ubuntu/.local/lib/python3.8/site-packages/MySQLdb/cursors.py", line 258, in _do_execute_many
for arg in args:
File "mysql_example.py", line 50, in cluster_ids
for cluster, scores in clustered_dupes:
File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/api.py", line 341, in cluster
yield from clustering.cluster(scores, threshold)
File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 238, in cluster
for sub_graph in dupe_sub_graphs:
File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 51, in connected_components
yield from _connected_components(edgelist, max_components)
File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 99, in _connected_components
for sub_graph in _connected_components(filtered_sub_graph, max_components):
File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 59, in _connected_components
component_stops = union_find(edgelist)
File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 114, in union_find
it = numpy.nditer(edgelist, ["external_loop"])
ValueError: Iteration of zero-sized operands is not enabled
I got the mysql example to work by adding the "zerosize_ok" option to numpy.nditer in clustering.py. I imagine that this would also resolve the OP postres example. I'm not a python developer so I don't want to issue a PR for this until I have a better understanding of what's going on. In the union_find function in clustering.py, I changed...
it = numpy.nditer(edgelist, ["external_loop"])
to...
it = numpy.nditer(edgelist, ["external_loop", "zerosize_ok"])
This still doesnt work for me, even with the fix above. Any new solutions?
I ran the postgres example as-is with a postgres database version 14.2 and dedupe version 2.0.17. After training and clustering, it will eventually fail during 'writing results' with the following error: