WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
948 stars 125 forks source link

_pickle.load fails for more than 64 words #68

Closed slavaGanzin closed 5 years ago

slavaGanzin commented 7 years ago

Hello I don't know is this connected to #50, so created a new issue:

This works ok:

import ahocorasick
import _pickle

A = ahocorasick.Automaton()
for i in range(0, 64):
    A.add_word(str(i), (i, i))
_pickle.dump(A, open('aho', 'wb'))
_pickle.load(open('aho', 'rb'))
#<ahocorasick.Automaton object at 0x7ff51acc4a58>

And this fails constantly:

import ahocorasick
import _pickle

A = ahocorasick.Automaton()
for i in range(0, 65):
    A.add_word(str(i), (i, i))
_pickle.dump(A, open('aho', 'wb'))
_pickle.load(open('aho', 'rb'))
#---------------------------------------------------------------------------
#ValueError                                Traceback (most recent call last)
#<ipython-input-129-f886db783629> in <module>()
#      3     A.add_word(str(i), (i, i))
#      4 _pickle.dump(A, open('aho', 'wb'))
#----> 5 _pickle.load(open('aho', 'rb'))

#ValueError: binary data truncated (2)
python --version
# Python 3.6.2

pip list | grep aho
# pyahocorasick (1.1.4)
WojciechMula commented 7 years ago

Thanks for the report.

serhiy-storchaka commented 6 years ago

Does it fail if replace

_pickle.dump(A, open('aho', 'wb'))

with

with open('aho', 'wb') as f:
    _pickle.dump(A, f)

?

WojciechMula commented 6 years ago

@serhiy-storchaka Why it might make any difference?

serhiy-storchaka commented 6 years ago

If the file was not closed, it might be not all data was written.

WojciechMula commented 6 years ago

Thanks, I missed that.

findmyway commented 6 years ago

I've tried, but still get the error

WojciechMula commented 6 years ago

@findmyway Could you please check the latests version of module?

findmyway commented 6 years ago

@WojciechMula

Excellent! It works now!

Thank you

scottwthompson commented 6 years ago

Hey I'm still getting this bug

pip freeze | grep pyahcorasick
> pyahocorasick==1.1.7.dev1

(same problem for 1.1.6 which is on pip)

AUTO = ahocorasick.Automaton()
for key,value in list(final_dic.items())[0:65]:
    AUTO.add_word(key,value)

AUTO.make_automaton()
import _pickle

with open(mypath,'wb') as f:
    _pickle.dump(AUTO,f)
with open(mypath,'rb') as f:
    s = _pickle.load(f)

> ValueError: binary data truncated (2)

AUTO = ahocorasick.Automaton()
for key,value in list(final_dic.items())[0:63]:
    AUTO.add_word(key,value)

AUTO.make_automaton()
import _pickle

with open(mypath,'wb') as f:
    _pickle.dump(AUTO,f)
with open(mypath,'rb') as f:
    s = _pickle.load(f)

> Parsed count 559

Anything I can try for a temporary fix? I saw your post that your stretched thin solving pickle issues just hoping to get some work around for now. Thanks for this lib, awesome work.

Python 3.6.3
Linux 4.9.65-1-MANJARO #1 SMP PREEMPT Fri Nov 24 10:42:19 UTC 2017 x86_64 GNU/Linux
WojciechMula commented 6 years ago

@scottwthompson I have no idea how to fix it right now. :(

scottwthompson commented 6 years ago

@WojciechMula No problem, great work. My work around was is to just recreate the automaton from data instead each time, for my purposes it's not too slow.

pombredanne commented 6 years ago

@scottwthompson why do you import _pickle? at least on Python2 this is how this works nicely: https://github.com/nexB/scancode-toolkit/blob/5dcb56815f0fba1e74d7a2314a0c98d0100eb295/src/licensedcode/index.py#L637

Here multiple automatons that are instance attributes of my LicenseIndex class are pickled without any problem: https://github.com/nexB/scancode-toolkit/blob/5dcb56815f0fba1e74d7a2314a0c98d0100eb295/src/licensedcode/index.py#L209

pombredanne commented 6 years ago

FWIW the automatons are created and used there: https://github.com/nexB/scancode-toolkit/blob/a9083191a04f62c05d588a22fa8f4839eeffc79d/src/licensedcode/match_aho.py

WojciechMula commented 5 years ago

Version 1.1.11 fixes the problem