WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
929 stars 122 forks source link

Killed while creating huge automation #49

Closed EmilStenstrom closed 7 years ago

EmilStenstrom commented 7 years ago

I'm trying to create a huge automation based on a data dump from wikidata. I have a dictionary with 14234049 keys, which corresponds to all English and Swedish labels of all entities there.

Here's my code:

dictionary = {"Belgium": [{"en": "Q31"}], ... }  # HUGE dictionary with over 14 million items
automation = ahocorasick.Automaton()
for word, matches in dictionary.items():
    automation.add_word(word, (word, matches))
automation.make_automaton()

The output I get is simply "Killed: 9".

I'm running Python 3.5 on a macOS 10.12.1, on a Macbook Pro with 16 Gb RAM.

WojciechMula commented 7 years ago

Could you locate when you script is killed? I mean, if it's killed while you call add_word or during make_automaton, after the loop finished.

EmilStenstrom commented 7 years ago

Thanks for the quick reply (I really appreciate all your hard work in this library!).

Unfortunately a rerun of my script takes over 4 hours, and then gets killed, so it will take a while to get the data of if this is in the add_word or make_automation phase. I'll see if I can create a testcase that reproduces this with dummy data.

EmilStenstrom commented 7 years ago

Here's a script that reproduces this for me:

import ahocorasick

print("Generating dictionary...")
dictionary = (("longstring" * 10 + str(a), "longstring") for a in range(100000000))

print("Adding words to automation...")
automation = ahocorasick.Automaton()
for i, (word, matches) in enumerate(dictionary):
    if i != 0 and i % 100000 == 0:
        print("Processed {} items...".format(i))
    automation.add_word(word, (word, matches))

print("Creating automation...")
automation.make_automaton()

For me, this script gets killed like this, at the "add_word" phase:

Processed 83600000 items...
Processed 83700000 items...
Processed 83800000 items...
Processed 83900000 items...
Processed 84000000 items...
Killed: 9

Is this even something that is solvable?

WojciechMula commented 7 years ago

Emil, sorry for all inconvenience. And thank you very much for the script, it's extremely helpful. I will try to debug the problem in next few days.

WojciechMula commented 7 years ago

@EmilStenstrom I quickly checked the script and my first conclusion is simply running out of memory. Application is not killed due to access violation (then SIGSEGV would be sent). There are two possibilities: either there is a nasty memory leak, or we reached limitations of the data structure.

EmilStenstrom commented 7 years ago

I think you are right. I guess my best bet is to see if I get minimize the amount of data that is stored in the automation. I try that, and maybe try to think of ways to split the data into smaller chunks and do several searches in them instead.

(Don't be sorry, you are a hero for doing all this work in your free time and releasing it to the public!)

EmilStenstrom commented 7 years ago

[This post was moved to a separate issue: #50]

WojciechMula commented 7 years ago

@EmilStenstrom Thank you, I think I can reproduce the bug on artificially generated data.

EmilStenstrom commented 7 years ago

That's great to hear! Let me know if you need anything else from me. I'm afraid I can't be of much use when it comes to C-code :/ Anything else and I'll be happy to help!

WojciechMula commented 7 years ago

@EmilStenstrom I tried different data patterns, but couldn't cause a segfault. Could you please share with me input data you use? Or better a procedure to recreate your data set.

EmilStenstrom commented 7 years ago

Sorry for confusing two different issues. I've opened a new issue for the pickle problem, and will be closing this one.

Here's the new issue: https://github.com/WojciechMula/pyahocorasick/issues/50