WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
927 stars 122 forks source link

memory leak in 1.4.0 #135

Open WojciechMula opened 3 years ago

WojciechMula commented 3 years ago

Hi, i'm sorry for opening such an old issue, but i'm currently experiencing the same issue. I'm using version 1.4.0 now and getting small steady memory leaks (after debugging with tracemalloc) on:

A = ahocorasick.Automaton() MyList = [...] for x in MyList: A.add_word(y, (y, z))

is there a chance this bug has returned

Thanks, Eden.

Originally posted by @EdenAzulay in https://github.com/WojciechMula/pyahocorasick/issues/81#issuecomment-705534956

AlonSh commented 3 years ago

I think that I'm also experiencing the same memory leak on add_word. Would love to see any updates :)

Edit - I was experiencing a different memory leak. My leak originated from using multiprocessing Pool and some issue with passing ahocorasick automaton between workers, I think there's some issue with serialization causing old objects not to be cleaned.

WojciechMula commented 3 years ago

@AlonSh could you please provide some minimal example?

AlonSh commented 3 years ago

Yeah: create some automaton create a multiprocessing Pool and do:

pool.apply_async(
            run_automaton,
            (automaton, text),
            callback=callback_success,
            error_callback=_my_error_callback,
        )

and you'll see your memory exploding after some calls.

WojciechMula commented 3 years ago

Great! Thank you.

pombredanne commented 2 years ago

I am pushing tests to run on the CI on many Linux ... but while I can have it fail locally on Ubuntu 16... the tests seem to pass on more recent linux. I wonder if this is not dependent on a certain version of the compiler? Otherwise, this is a head scratcher. @AlonSh FWIW, I recycle processes after a 1000 calls in my pools to cope with leaks. Not perfect, but a workaround at least. See for instance https://github.com/nexB/scancode-toolkit/blob/e080f8354bed5813df9b619efe575ce9931a5a5b/src/scancode/cli.py#L1209

Azzonith commented 1 year ago

Hello guys,

Is there any update on this issue? I've tested a library version 2.0.0 today and memory consumption added up every time automaton was used in ProcessPoolExecutor futures. We had to stop the service after RAM consumption crossed 140GB. I attempted to build an image FROM ubuntu:20.04, python:3.8, python:3.10. The issue is reproduced every time. The latest usable lib version for us remains 1.1.8. Please let me know if there is any troubleshooting info I could provide for the research.