dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.15k stars 551 forks source link

Dedupe prepare_training() error for more than 5K records #1111

Closed SantyGator closed 1 year ago

SantyGator commented 2 years ago

I am loading a data set of about 5k records and I am able to get past the prepare_training step without errors. However, when I increase the records from 5K to 5.5K or 6K I m getting this error. Any idea where I am going wrong?

Traceback (most recent call last): File "/app/scripts/python_dedupe/./python_dedupe_pg_test.py", line 134, in deduper.prepare_training(temp_d, training_file=None, sample_size=5000, blocked_proportion=.5) File "/usr/local/lib/python3.11/site-packages/dedupe/api.py", line 1327, in prepare_training self.active_learner = labeler.DedupeDisagreementLearner( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/dedupe/labeler.py", line 393, in init self.blocker = DedupeBlockLearner(candidate_predicates, data, index_include) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/dedupe/labeler.py", line 219, in init index_data = sample_records(data, 50000) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/dedupe/labeler.py", line 443, in sample_records keys = random.sample(keys, sample_size) # type: ignore[assignment] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/random.py", line 436, in sample raise TypeError("Population must be a sequence. " TypeError: Population must be a sequence. For dicts or sets, use sorted(d).

fgregg commented 1 year ago

this was fixed in https://github.com/dedupeio/dedupe/pull/1115