Closed EmilStenstrom closed 7 years ago
Could you locate when you script is killed? I mean, if it's killed while you call add_word or during make_automaton, after the loop finished.
Thanks for the quick reply (I really appreciate all your hard work in this library!).
Unfortunately a rerun of my script takes over 4 hours, and then gets killed, so it will take a while to get the data of if this is in the add_word or make_automation phase. I'll see if I can create a testcase that reproduces this with dummy data.
Here's a script that reproduces this for me:
import ahocorasick
print("Generating dictionary...")
dictionary = (("longstring" * 10 + str(a), "longstring") for a in range(100000000))
print("Adding words to automation...")
automation = ahocorasick.Automaton()
for i, (word, matches) in enumerate(dictionary):
if i != 0 and i % 100000 == 0:
print("Processed {} items...".format(i))
automation.add_word(word, (word, matches))
print("Creating automation...")
automation.make_automaton()
For me, this script gets killed like this, at the "add_word" phase:
Processed 83600000 items...
Processed 83700000 items...
Processed 83800000 items...
Processed 83900000 items...
Processed 84000000 items...
Killed: 9
Is this even something that is solvable?
Emil, sorry for all inconvenience. And thank you very much for the script, it's extremely helpful. I will try to debug the problem in next few days.
@EmilStenstrom I quickly checked the script and my first conclusion is simply running out of memory. Application is not killed due to access violation (then SIGSEGV would be sent). There are two possibilities: either there is a nasty memory leak, or we reached limitations of the data structure.
I think you are right. I guess my best bet is to see if I get minimize the amount of data that is stored in the automation. I try that, and maybe try to think of ways to split the data into smaller chunks and do several searches in them instead.
(Don't be sorry, you are a hero for doing all this work in your free time and releasing it to the public!)
[This post was moved to a separate issue: #50]
@EmilStenstrom Thank you, I think I can reproduce the bug on artificially generated data.
That's great to hear! Let me know if you need anything else from me. I'm afraid I can't be of much use when it comes to C-code :/ Anything else and I'll be happy to help!
@EmilStenstrom I tried different data patterns, but couldn't cause a segfault. Could you please share with me input data you use? Or better a procedure to recreate your data set.
Sorry for confusing two different issues. I've opened a new issue for the pickle problem, and will be closing this one.
Here's the new issue: https://github.com/WojciechMula/pyahocorasick/issues/50
I'm trying to create a huge automation based on a data dump from wikidata. I have a dictionary with 14234049 keys, which corresponds to all English and Swedish labels of all entities there.
Here's my code:
The output I get is simply "Killed: 9".
I'm running Python 3.5 on a macOS 10.12.1, on a Macbook Pro with 16 Gb RAM.