dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.05k stars 547 forks source link

csv_example: Negative size passed to PyString_FromStringAndSize error #571

Closed suneetdewan closed 7 years ago

suneetdewan commented 7 years ago

Hi, I am just getting started with dedupe and tried to run the csv_example out of the box and ran into this error:

Traceback (most recent call last): File "csv_example.py", line 161, in threshold = deduper.threshold(data_d, recall_weight=2) File "/Users/suneetdewan/Documents/projects/dedupe/dedupe/api.py", line 237, in threshold return self.thresholdBlocks(blocked_pairs, recall_weight) File "/Users/suneetdewan/Documents/projects/dedupe/dedupe/api.py", line 68, in thresholdBlocks probability = core.scoreDuplicates(self._blockedPairs(blocks), File "/Users/suneetdewan/Documents/projects/dedupe/dedupe/api.py", line 248, in _blockedPairs block, blocks = core.peek(blocks) File "/Users/suneetdewan/Documents/projects/dedupe/dedupe/core.py", line 279, in peek record = next(records) File "/Users/suneetdewan/Documents/projects/dedupe_env/lib/python2.7/site-packages/future/builtins/newnext.py", line 59, in newnext return iterator.next() File "/Users/suneetdewan/Documents/projects/dedupe/dedupe/api.py", line 281, in _blockData for block in viewvalues(blocks): File "/Users/suneetdewan/Documents/projects/dedupe_env/lib/python2.7/site-packages/future/utils/init.py", line 297, in viewvalues return func(**kwargs) File "/Users/suneetdewan/Documents/projects/dedupeenv/lib/python2.7/UserDict.py", line 120, in values return [v for , v in self.iteritems()] File "/Users/suneetdewan/Documents/projects/dedupe_env/lib/python2.7/UserDict.py", line 110, in iteritems for k in self: File "/Users/suneetdewan/Documents/projects/dedupe_env/lib/python2.7/UserDict.py", line 97, in iter for k in self.keys(): File "/Users/suneetdewan/anaconda/lib/python2.7/shelve.py", line 101, in keys return self.dict.keys() SystemError: Negative size passed to PyString_FromStringAndSize

running on python 2.7 with osx.

fgregg commented 7 years ago

@suneetdewan could you please run

otool -L $(python3.6 -c 'import _dbm;print(_dbm.__file__)')

on the command line and paste what you see.

cipriantarta commented 7 years ago

otool -L $(python3.6 -c 'import _dbm;print(_dbm.file)') /usr/local/var/pyenv/versions/3.6.1/lib/python3.6/lib-dynload/_dbm.cpython-36m-darwin.so: /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1238.50.2)

fgregg commented 7 years ago

@liufuyang did you have this problem? And find a solution?

liufuyang commented 7 years ago

@fgregg no I never encountered this exactly problem but I had another dbm issue saying

HASH: Out of overflow pages.  Increase page size
Traceback (most recent call last):
  File "/Users/tendres/PycharmProjects/dedupe/tests/test_shelve.py", line 25, in <module>
    shelf[k] += [(i, record, ids)]
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/shelve.py", line 125, in __setitem__
    self.dict[key.encode(self.keyencoding)] = f.getvalue()
_dbm.error: cannot add item to database

Process finished with exit code 1

Which, I assume is due to the MacOS doesn't have gdbm linked on python3 (I tried brew install both python3 and brew install gdbm, they are installed but not linked together) so when shelve import dbm it cannot uses gdbm (which seem to have no space limit) but uses ndbm, and it has some page limitation on MacOS. (I have debugged and confirmed this behavior)

I could not solve this issue and instead fall back to the dedupe versoin 1.6.5 as my solution for now ...

BTW I think this issue is related to this as well: https://github.com/dedupeio/csvdedupe/issues/67

fgregg commented 7 years ago

would someone who has had the Negative size passed to PyString_FromStringAndSize error try the temp_shelve branch and let me know if they still have the error? @brandonlipman @motrippy,

fgregg commented 7 years ago

I believe I may have addressed this. Please reopen if not.

akkaneko commented 6 years ago

I'm having the exact same problem. Here's the error:

Traceback (most recent call last): File "csv_example.py", line 151, in threshold = deduper.threshold(data_d, recall_weight=1) File "/Users/ar-apollo.kaneko/Library/Python/2.7/lib/python/site-packages/dedupe/api.py", line 244, in threshold return self.thresholdBlocks(blocked_pairs, recall_weight) File "/Users/ar-apollo.kaneko/Library/Python/2.7/lib/python/site-packages/dedupe/api.py", line 69, in thresholdBlocks candidate_records = itertools.chain.from_iterable(self._blockedPairs(blocks)) File "/Users/ar-apollo.kaneko/Library/Python/2.7/lib/python/site-packages/dedupe/api.py", line 255, in _blockedPairs block, blocks = core.peek(blocks) File "/Users/ar-apollo.kaneko/Library/Python/2.7/lib/python/site-packages/dedupe/core.py", line 361, in peek record = next(records) File "/Users/ar-apollo.kaneko/Library/Python/2.7/lib/python/site-packages/future/builtins/newnext.py", line 59, in newnext return iterator.next() File "/Users/ar-apollo.kaneko/Library/Python/2.7/lib/python/site-packages/dedupe/api.py", line 288, in blockData for block in viewvalues(blocks): File "/Users/ar-apollo.kaneko/Library/Python/2.7/lib/python/site-packages/future/utils/init.py", line 297, in viewvalues return func(kwargs) File "/Users/ar-apollo.kaneko/Library/Python/2.7/lib/python/site-packages/dedupe/core.py", line 434, in values return viewvalues(self.shelve) File "/Users/ar-apollo.kaneko/Library/Python/2.7/lib/python/site-packages/future/utils/init.py", line 297, in viewvalues return func(kwargs) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/UserDict.py", line 120, in values return [v for , v in self.iteritems()] File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/UserDict.py", line 110, in iteritems for k in self: File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/UserDict.py", line 97, in iter for k in self.keys(): File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shelve.py", line 101, in keys return self.dict.keys() SystemError: Negative size passed to PyString_FromStringAndSize

running 2.7 with osx