I am loading a data set of about 5k records and I am able to get past the prepare_training step without errors.
However, when I increase the records from 5K to 5.5K or 6K I m getting this error. Any idea where I am going wrong?
Traceback (most recent call last):
File "/app/scripts/python_dedupe/./python_dedupe_pg_test.py", line 134, in
deduper.prepare_training(temp_d, training_file=None, sample_size=5000, blocked_proportion=.5)
File "/usr/local/lib/python3.11/site-packages/dedupe/api.py", line 1327, in prepare_training
self.active_learner = labeler.DedupeDisagreementLearner(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dedupe/labeler.py", line 393, in init
self.blocker = DedupeBlockLearner(candidate_predicates, data, index_include)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dedupe/labeler.py", line 219, in init
index_data = sample_records(data, 50000)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dedupe/labeler.py", line 443, in sample_records
keys = random.sample(keys, sample_size) # type: ignore[assignment]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/random.py", line 436, in sample
raise TypeError("Population must be a sequence. "
TypeError: Population must be a sequence. For dicts or sets, use sorted(d).
I am loading a data set of about 5k records and I am able to get past the prepare_training step without errors. However, when I increase the records from 5K to 5.5K or 6K I m getting this error. Any idea where I am going wrong?
Traceback (most recent call last): File "/app/scripts/python_dedupe/./python_dedupe_pg_test.py", line 134, in
deduper.prepare_training(temp_d, training_file=None, sample_size=5000, blocked_proportion=.5)
File "/usr/local/lib/python3.11/site-packages/dedupe/api.py", line 1327, in prepare_training
self.active_learner = labeler.DedupeDisagreementLearner(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dedupe/labeler.py", line 393, in init
self.blocker = DedupeBlockLearner(candidate_predicates, data, index_include)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dedupe/labeler.py", line 219, in init
index_data = sample_records(data, 50000)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dedupe/labeler.py", line 443, in sample_records
keys = random.sample(keys, sample_size) # type: ignore[assignment]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/random.py", line 436, in sample
raise TypeError("Population must be a sequence. "
TypeError: Population must be a sequence. For dicts or sets, use sorted(d).