dedupeio / dedupe-examples

:id: Examples for using the dedupe library
MIT License
404 stars 216 forks source link

MySQL - Overflow error #91

Closed carterrees closed 5 years ago

carterrees commented 5 years ago

Running the MySQL example code tweaked to my data. The code runs fine on a smaller 500,000 subset of the data. Total of 1,000,000 rows of data. Howerver, running on the full 1,000,000 row data set throws the error below. Any help is appreciated.

creating entity_map database 10000 blocks 498.3930940628052 seconds 20000 blocks 642.5384395122528 seconds 30000 blocks 699.6566441059113 seconds 40000 blocks 720.7061486244202 seconds 50000 blocks 759.5518596172333 seconds 60000 blocks 775.3582143783569 seconds 70000 blocks 782.8726105690002 seconds 80000 blocks 789.5328388214111 seconds 90000 blocks 844.1630408763885 seconds 100000 blocks 23769.673866271973 seconds 110000 blocks 23780.990204572678 seconds 120000 blocks 23818.60210084915 seconds 130000 blocks 23834.79001903534 seconds 140000 blocks 24488.016387462616 seconds 150000 blocks 24502.224097251892 seconds DEBUG:dedupe.api:matching done, begin clustering Traceback (most recent call last): File "customer_master_mysql.py", line 395, in for cluster, scores in clustered_dupes: File "/home/carterrees/PycharmProjects/data_services_predictopotamus/venv_predictopotamus36/lib64/python3.6/site-packages/dedupe/api.py", line 125, in matchBlocks for cluster in self._cluster(matches, threshold, *args, **kwargs): File "/home/carterrees/PycharmProjects/data_services_predictopotamus/venv_predictopotamus36/lib64/python3.6/site-packages/dedupe/clustering.py", line 148, in cluster for sub_graph in dupe_sub_graphs: File "/home/carterrees/PycharmProjects/data_services_predictopotamus/venv_predictopotamus36/lib64/python3.6/site-packages/dedupe/clustering.py", line 22, in connected_components components = union_find(edgelist['pairs']) File "/home/carterrees/PycharmProjects/data_services_predictopotamus/venv_predictopotamus36/lib64/python3.6/site-packages/dedupe/clustering.py", line 90, in union_find components[root_a].append(i) OverflowError: unsigned int is greater than maximum

carterrees commented 5 years ago

Moved to main Dedupe issue board.