WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
948 stars 125 forks source link

Memory leaks when unpickling Automaton #62

Closed blackelk closed 5 years ago

blackelk commented 7 years ago

How to reproduce:

import pickle
import time

import ahocorasick
import psutil
import requests

automaton = ahocorasick.Automaton()

r = requests.get('https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm')
assert r.ok

for word in r.text.split():
    if word.isalpha():
        automaton.add_word(word.encode('utf8').lower(), word)

automaton.make_automaton()
pickled = pickle.dumps(automaton)

print('Cycles    Free MiB')
for i in range(10000):
    if i % 1000 == 0:
        free = psutil.virtual_memory().free
        print('{:05d}     {}'.format(i, free/1000000))
    unpickled = pickle.loads(pickled)
    time.sleep(0.001)
# Tested on environment:
#
# pyahocorasick==1.1.4
# Python 2.7.9-1
# Linux 3.16.0-4-amd64 Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 GNU/Linux

# Results:
#
# Cycles    Free MiB
# 00000     2068
# 01000     2026
# 02000     1984
# 03000     1942
# 04000     1900
# 05000     1858
# 06000     1816
# 07000     1774
# 08000     1732
# 09000     1690

Same on python3 - remove .encode('utf8')

pombredanne commented 7 years ago

@blackelk Thanks! do you think this could be linked to the Python version in anyway?

WojciechMula commented 7 years ago

@blackelk Thank you, will look at this.

WojciechMula commented 7 years ago

@blackelk I've spent two hours trying to figure the source of a leak, but couldn't find it.

BTW I think using virtual memory from /proc as a reference is not the best and might be misleading. The python runtime can allocate memory for its own purposes. When gc.collect() is run from time to time, then the memory usage shown by script you wrote is not as big. However, there still is some leak. I'll keep looking for it.