dedupeio / dedupe-examples

:id: Examples for using the dedupe library
MIT License
404 stars 216 forks source link

Getting Error while running csv example for my file #136

Open purnima1612 opened 4 months ago

purnima1612 commented 4 months ago

Hello all , I am trying to run csv exmaple for my file which has 850 records . Also I am trying to find duplicates based on custom function which Levenshtein distance . Trying to group all names under one entity_num which shre match of name more than 80% .

While preparning data I changed smaple size to 50
deduper.prepare_training(data_d,sample_size=50 )

after I finish labeling I am getting following error


Traceback (most recent call last):
  File "C:\Python_Projects\Python_extra_code\csv_example.py", line 132, in <module>
    deduper.train()
  File "C:\Dev\Python3.11\Lib\site-packages\dedupe\api.py", line 1215, in train
    self.predicates = self.active_learner.learn_predicates(recall, index_predicates)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Dev\Python3.11\Lib\site-packages\dedupe\labeler.py", line 397, in learn_predicates
    return self.blocker.learn_predicates(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Dev\Python3.11\Lib\site-packages\dedupe\labeler.py", line 136, in learn_predicates
    return self.block_learner.learn(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Dev\Python3.11\Lib\site-packages\dedupe\training.py", line 72, in learn
    candidate_cover = self.random_forest_candidates(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Dev\Python3.11\Lib\site-packages\dedupe\training.py", line 112, in random_forest_candidates
    sample_predicates = random.sample(predicates, pred_sample_size)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Dev\Python3.11\Lib\random.py", line 453, in sample
    raise ValueError("Sample larger than population or is negative")
ValueError: Sample larger than population or is negative

Process finished with exit code 1