Closed EmilStenstrom closed 5 years ago
I understand that this is something that is tricky to reproduce. Therefore I've created a new repository with my code, and invited you to that repository. I've added documentation on how to run the code there.
Beware that running through the full wikidata dump with 24 million entries takes several hours. After all the building is done you can quickly run the example and see it fail.
Somehow the UNLIKELY(size < count*(sizeof(TrieNode) - sizeof(TrieNode*)))
returns true, preventing the pickle file to be read.
Let me know if there is anything I can do to help troubleshoot this.
@EmilStenstrom Thanks a lot for your effort! I'll try to reproduce the bug.
@EmilStenstrom I was able to build wikidata-reduced.json, it was really time consuming. :) Now can I debug, thank you.
@WojciechMula Phew. Now you have some more waiting to do as you build the automation, and then try to search it. Building the automation works. The crash occurs when you try to load it using the last command.
Sorry about the long waits :)
@EmilStenstrom I'm working on this issue now, and managed to fix ugly memory leak. It's not a fix yet. :)
@WojciechMula That sounds fantastic! :) I'm happy all that processing power didn't go to waste.
@EmilStenstrom I'm still trying to reproduce the bug. Unfortunately, my laptop has too few memory and your app is killed after eating all 4GB. I tried to split input and then build/pickle/unpickle smaller chunks, but nothing wrong happened so far. I supposed there were some unicode-related problems (like #53), but it seems it's not a case. Just writing to give you feedback.
@WojciechMula I'm thinking of different ways of helping out. Would it help if I sent you the pickle file? It is 88 Mb if I zip it, so I think I can give you a dropbox link? What e-mail should I send the link to?
I had an idea that maybe the whole file was truncated? So I found that you can inspect a pickle file with pickletools from the python library. But it seems it ends in the expected way:
$ python -m pickletools wikidata_automation.pickle | tail
268149006: r LONG_BINPUT 13858041
268149011: X BINUNICODE 'Q27876039'
268149025: r LONG_BINPUT 13858042
268149030: e APPENDS (MARK at 268141046)
268149031: t TUPLE (MARK at 27)
268149032: r LONG_BINPUT 13858043
268149037: R REDUCE
268149038: r LONG_BINPUT 13858044
268149043: . STOP
highest protocol among opcodes = 3
@EmilStenstrom If you can, please send me the pickle file directly. My e-mail: wojciech_mula@poczta.onet.pl
Sent the link to your e-mail!
I also pushed some updates the the script that creates the wikidata-reduced.json file (I sent you the old file, not the updated one to make sure you can reproduce). The updated file is now excluding lots of entities I'm not interested in anyway. Should be about half the size. Maybe that makes it possible to create the automation on 4 Gb? I'm on a Macbook Pro from work with 16 Gb RAM, so I can deal with huge files.
@EmilStenstrom Just clicked what's wrong. If your automaton takes several gigabytes it's almost impossible that a pickle file would be several times smaller.
@WojciechMula: So something is wrong with how I create the pickle file?
@EmilStenstrom You're doing everything perfectly right, there's some bug in pickle. Just created automaton with 1.000.000 words and pickled file has 350MB.
@EmilStenstrom You've shown tail of the pickled file, but could you please the beginning on file. On my system I have:
$ python3 -m pickletools ref.pickle
0: \x80 PROTO 3
2: c GLOBAL 'ahocorasick Automaton'
25: q BINPUT 0
27: ( MARK
28: J BININT 168760016
33: C SHORT_BINBYTES b''
35: q BINPUT 1
37: K BININT1 2
39: K BININT1 2
41: J BININT 16182875
46: J BININT 16182874
51: M BININT2 310
54: ] EMPTY_LIST
55: q BINPUT 2
57: ( MARK
For sure the file is corrupted. At offset 33 is an empty bytes object, while it should be a large blob of data, field at offset 37 should be 20. For now I have no idea what's wrong, of course will continue work on this.
Is everything OK when you build smaller automatons? Do you have own compilation of Python, or it comes from precompiled package?
Here's the first 20 lines of my file:
$ python -m pickletools wikidata_automation.pickle | head -n20
0: \x80 PROTO 3
2: c GLOBAL 'ahocorasick Automaton'
25: q BINPUT 0
27: ( MARK
28: J BININT 168760016
33: C SHORT_BINBYTES b''
35: q BINPUT 1
37: K BININT1 2
39: K BININT1 2
41: J BININT 16182875
46: J BININT 16182874
51: M BININT2 310
54: ] EMPTY_LIST
55: q BINPUT 2
57: ( MARK
58: X BINUNICODE 'Q23600353'
72: q BINPUT 3
74: X BINUNICODE 'Q14877373'
88: q BINPUT 4
90: X BINUNICODE 'Q26446664'
Looks very similar to yours.
I'm using the latest stable version of python (3.5.2) that is distributed with Homebrew (the most popular package manager for Mac).
$ brew info python3
python3: stable 3.5.2 (bottled), devel 3.6.0rc1, HEAD
Interpreted, interactive, object-oriented programming language
https://www.python.org/
/usr/local/Cellar/python3/3.5.2 (3,664 files, 55.0M) *
Poured from bottle on 2016-07-07 at 20:30:19
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/python3.rb
$ python3.5
Python 3.5.2 (default, Jun 29 2016, 13:43:58)
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
I've now tried with a couple of different files. First a new one generated with the updated script. It removes all empty labels:
$ python run_wikidata_search.py wikidata_automation_noempty.pickle "Belgium, Sweden and Poland are three fine countries"
Traceback (most recent call last):
File "run_wikidata_search.py", line 17, in <module>
main(filename_in, text)
File "run_wikidata_search.py", line 11, in main
automation = pickle.load(f)
ValueError: binary data truncated (3)
Same error, but with a (3) at the end instead of a (1) as before. When I try to run pickletools on this file i get:
$ python -m pickletools wikidata_automation_noempty.pickle
0: \x80 PROTO 3
2: c GLOBAL 'ahocorasick Automaton'
25: q BINPUT 0
27: ( MARK
28: J BININT 144852559
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/pickletools.py", line 2833, in <module>
args.indentlevel, annotate)
File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/pickletools.py", line 2475, in dis
print(line, file=out)
OSError: [Errno 22] Invalid argument
And it hangs a LONG time before outputing the OSError which I think confirms that this file contains the large blob of data that should be there.
I've also tried with a much smaller wikidata-reduced-file (only 10 lines) and everything works fine there. Inspecting that file with pickletools yields the correct results:
$ python -m pickletools wikidata_automation_mini.pickle
0: \x80 PROTO 3
2: c GLOBAL 'ahocorasick Automaton'
25: q BINPUT 0
27: ( MARK
28: M BININT2 302
31: B BINBYTES b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x0b\x00\x00\x00\x00\x00\x00\x00"\x00\x00\x00\x00\x00\x00\x00$\x00\x00\x00\x00\x00\x00\x005\x00\x00\x00\x00\x00\x00\x00?\x00\x00\x00\x00\x00\x00\x00E\x00\x00\x00\x00\x00\x00\x00P\x00\x00\x00\x00\x00\x00\x00\\\x00\x00\x00\x00\x00\x00\x00a\x00\x00\x00\x00\x00\x00\x00j\x00\x00\x00\x00...
@EmilStenstrom Thank you very much for checking this. I have some vague ideas about the source of errors, but need to verify it. I haven't replicated your problems yet.
@EmilStenstrom Sorry for a stupid question, but: is your MacOS 64-bit?
@WojciechMula Yes. The processor is an "Intel Core i7" which is 64 bit, and the macOS version is Sierra which runs in 64 bit mode. Also, my python returns 64 bit:
$ python -c "import platform; print(platform.architecture())"
('64bit', '')
@EmilStenstrom Thank you, I supposed it might be somehow related to integer overflows. ATM I have no idea how to reproduce the error and what might be its cause.
Could you recompile the module with -fsanitize=address and -fsanitize=undefined. I think setting CFLAGS is sufficient, i.e.:
export CFLAGS="-fsanitize=address -fsanitize=undefined"
@EmilStenstrom I didn't forget about the problem, just run out of ideas.
Hi! I'm still planning to try the compile flags you suggested above, didn't have time! Maybe next week!
Here's the output after running with the CFLAGS you suggested:
$ python run_wikidata_build_automation.py wikidata-reduced.json wikidata_automation.pickle
==57697==ERROR: Interceptors are not working. This may be because AddressSanitizer is loaded too late (e.g. via dlopen). Please launch the executable with:
DYLD_INSERT_LIBRARIES=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/8.0.0/lib/darwin/libclang_rt.asan_osx_dynamic.dylib
==57697==AddressSanitizer CHECK failed: /Library/Caches/com.apple.xbs/Sources/clang_compiler_rt/clang-800.0.42.1/src/projects/compiler-rt/lib/sanitizer_common/sanitizer_mac.cc:690 "(("interceptors not installed" && 0)) != (0)" (0x0, 0x0)
<empty stack>
Abort trap: 6
$ export DYLD_INSERT_LIBRARIES=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/8.0.0/lib/darwin/libclang_rt.asan_osx_dynamic.dylib
$ python run_wikidata_build_automation.py wikidata-reduced.json wikidata_automation.pickle
Building automaton...
Building automaton, step 0...
Building automaton, step 100000...
Building automaton, step 200000...
...
Building automaton, step 16800000...
Time to make it...
Killed: 9
It takes all my RAM (16 Gb) for about an hour, and then gets killed.
I guess we won't get any further from here. I think I should try to solve my problem in another way. Instead of trying to build the Trie in memory, I should persist it to disk in some sort of database optimized for this usecase. Thank you for all you hard work!
@EmilStenstrom Thank you very much for your time and effort. I really want to fix that bug, but so far I couldn't. :(
As far I understand your problem, you could try ngram-indexes. They allow to narrow searched space significantly, and are not too complicated. I did some experiments with full-text search and results were impressive.
Hi
I'm hitting something like this with Python-3.5 in windows. It seems related to Python-3.5 as I can read the same pickle in 3.4 without error and use the automaton.
I'll email you details of the files and load them in dropbox for you to download. The dataset it much smaller than the one mentioned here (the pickle is only 17Mb.
It doesn't seem to matter if the pickle is created in 3.4 or 3.5. The read issue only happens in Windows though.
Reading it in linux returns an automaton with no words! Guess that is too much to hope for!
David, thank you very much, will look closer at this. I've already download the file.
Tested with windows in Python 3.6 and no error, so looks like Windows Python 3.5 only.
That's a great news. Thank you for checking this.
And thank you for the regression test.
@woakesd I installed all official versions: 3.5.0, 3.5.1, 3.5.2 and 3.5.3 -- and I was able to load the pickle file you shared with me. I tested also 3.4.4 and 3.6.0. The regression test also passes. Strange.
Which specific version of Python do you use?
I removed 3.5.3 and reinstalled everything including version 1.1.5.dev1 build locally against 3.5.2
The regression test failed still.
I uninstalled pyahocorasick and rebuilt against the installed version of python 3.5.3 and ran the test again and it works.
I still can't load the automaton.pickle file!
I've split the test into two files. Could you see if this still works for you? It doesn't here.
It works for me. I use MSVC 2015 to compile the extension and I'm on repo's head (1.1.5.dev1 is uncompilable in Windows).
@woakesd I just committed some debug code c79bd66246b07c6d120cce5f7817af8eb1f3817c, could you please check it out? On my machine I get following output:
unpickle: 7 nodes
unpickle: node #1 at offset 0
unpickle: node #1.fail = 0
unpickle: node #1.letter = 0
unpickle: node #1.eow = 0
unpickle: node #1.n = 2
unpickle: node #1.next[0] = 2
unpickle: node #1.next[1] = 5
unpickle: node #2 at offset 40
unpickle: node #2.fail = 0
unpickle: node #2.letter = 97
unpickle: node #2.eow = 0
unpickle: node #2.n = 1
unpickle: node #2.next[0] = 3
unpickle: node #3 at offset 72
unpickle: node #3.fail = 0
unpickle: node #3.letter = 98
unpickle: node #3.eow = 0
unpickle: node #3.n = 1
unpickle: node #3.next[0] = 4
unpickle: node #4 at offset 104
unpickle: node #4.fail = 0
unpickle: node #4.letter = 99
unpickle: node #4.eow = 1
unpickle: node #4.n = 0
unpickle: node #5 at offset 128
unpickle: node #5.fail = 0
unpickle: node #5.letter = 100
unpickle: node #5.eow = 0
unpickle: node #5.n = 1
unpickle: node #5.next[0] = 6
unpickle: node #6 at offset 160
unpickle: node #6.fail = 0
unpickle: node #6.letter = 101
unpickle: node #6.eow = 0
unpickle: node #6.n = 1
unpickle: node #6.next[0] = 7
unpickle: node #7 at offset 192
unpickle: node #7.fail = 0
unpickle: node #7.letter = 102
unpickle: node #7.eow = 1
unpickle: node #7.n = 0
With the repo head in 3.5.3 I get no output apart from the error. I added one extra trace just to be sure, but it doesn't look like it gets as far as automaton_unpickle.
What is the output from this script on your machine?
from ahocorasick import Automaton
auto = Automaton()
auto.add_word('abc', 'abc')
auto.add_word('def', 'def')
x = auto.__reduce__()
print(x)
I got this:
(<class 'ahocorasick.Automaton'>, (7, b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00
\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00a\x00\x03\x00\x00\x00\x00\x00\x0
0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x0
0\x00\x00\x00b\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x
00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00c\x00\x00\x00\x00\x00\
x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00d\x00\x06
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x01\x00\x00\x00\x00\x00e\x00\x07\x00\x00\x00\x00\x00\x00\x00\x00\x0
0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x0
0f\x00', 1, 30, 100, 2, 2, 3, ['abc', 'def']))
I get the following:
(<class 'ahocorasick.Automaton'>, (7, b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x82\x01\x00\x00\x02\x00\x00\x00\x00\x00
\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x82\x01a\x00\x03\x00\x00\x00\x00\x00\x0
0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x0
0\x00\x82\x01b\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x
00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x83\x01c\x00\x00\x00\x00\x00\
x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x82\x01d\x00\x06
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x01\x00\x00\x00\x82\x01e\x00\x07\x00\x00\x00\x00\x00\x00\x00\x00\x0
0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x83\x0
1f\x00', 1, 30, 100, 2, 2, 3, ['abc', 'def']))
I deleted the comment where I thought it was working because I tested with 3.6 not 3.5.3. Sigh
I added another trace which produces the following output:
size 216, count 7, sizeof(TrieNode) 32, sizeof(TrieNode*) 8
Could you send me a wheel for 64 bit 3.5.3.
I'm exploring the idea that there is a build config issue with my laptop.
I'm installing Visual Studio 2015 on another laptop just now to try it out on another machine
Found how to break like this and why I think it works for you!
Using 32 bit python 3.5.3 in Windows I can create and load the pickle no problem.
The 64 bit version is where the issue lies.
The pickle for 64 bit windows is larger, 294 bytes instead of 214 bytes.
@woakesd I'm not on Windows right now, but I'm pretty sure that I have 64-bit versions of python and compilation also produces 64-bit binaries. But it might be a proper hint, thank you for checking it.
I will send you my compiled modules tomorrow.
@EmilStenstrom do you mind trying with the latest release?
@pombredanne Sorry, I don't have any of the code left I used for this. Since my usecase was too big for RAM I just decided to go another route...
I have the same problem. I create a large automaton (several gigabytes in memory), pickle and load from disk: ValueError: binary data truncated (1)
Python 3.6.6 x64 Windows 10, installed with pip install pyahocorasick
I cannot test if 32-bit python works, because the dataset is too large and I get a memory error (SystemError: <built-in method add_word of ahocorasick.Automaton object at 0x080305E0> returned NULL without setting an error
)
@Dobatymo is it possible to somehow get the dataset you use? I'd love to finally fix the bug, but I'm not able to reproduce it on my own.
I use this one https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz Tomorrow I can check which version/date of the dump exactly and give you the code to reproduce it.
Great! Thank you
Hah, nice! That’s the original dataset I used too. But I used the Swedish version of Wikipedia, not the English one. The idea was to quickly find all Wikipedia articles from a span of text.
Some thoughts:
1) does the automation get bigger than ram? 2) is there very long strings in Wikipedia that somehow throws this off? 3) are there Unicode codepoints that mess things up?
Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) [MSC v.1900 64 bit (AMD64)] on win32
import gzip, pickle
import ahocorasick
def read(wiki_titles):
with gzip.open(wiki_titles, "rt", encoding="utf-8") as fr:
for line in fr:
yield line.strip()
def create_automaton(wiki_titles):
a = ahocorasick.Automaton()
for i, line in enumerate(read(wiki_titles)):
a.add_word(line.lower(), i)
a.make_automaton()
return a
if __name__ == "__main__":
# https://dumps.wikimedia.org/enwiki/20180701/enwiki-20180701-all-titles-in-ns0.gz
wiki_path = "enwiki-20180701-all-titles-in-ns0.gz"
pickle_path = "enwiki.p"
with open(pickle_path, "wb") as fw:
a = create_automaton(wiki_path)
pickle.dump(a, fw)
del a
with open(pickle_path, "rb") as fr:
a = pickle.load(fr)
Traceback (most recent call last):
File "...\test.py", line 30, in <module>
a = pickle.load(fr)
ValueError: binary data truncated (1)
Memory usage maxes out at 10.75 GB (just small enough to work on my 16GB machine)
@WojciechMula Maybe it's time to close this bug, until someone sees this problem with the latest version of the code, and Python 3.6+?
I've managed to create an automation, and then pickle that automation to a 286 Mb pickle file. Problem is, when I try to unpickle, I get this error:
The source of that error is here: https://github.com/WojciechMula/pyahocorasick/blob/master/Automaton_pickle.c#L309
Would you mind helping me troubleshoot this? Any ideas? I don't think I can send files this big to you?
Update: This is how I build the pickle file:
Where generator just runs
yield ("Belgium", "Q31")
.