SamuraiT / mecab-python3

:snake: mecab-python. you can find original version here:http://taku910.github.io/mecab/
https://pypi.python.org/pypi/mecab-python3
Other
539 stars 51 forks source link

Memory leak in mecab-python3 0.996.2? #34

Closed orangain closed 4 years ago

orangain commented 5 years ago

When parsing large text many times with mecab-python3 0.996.2, python process consumes too many memories. Finally, the process falls into significant slow down.

For example, after about 200 times of loop when running the following code, python process consumes 80% of machine's memory. Consumed memory does not seem to be released and the process does not make progress at all.

import MeCab

def main():
    with open('kokoro.utf8.txt') as f:
        text = f.read()

    tagger = MeCab.Tagger()
    tagger.parse('')  # Hack for mecab-python3 0.7
    for i in range(10000):
        if i % 10 == 0:
            print(i)

        parse(tagger, text)

def parse(tagger, text):
    node = tagger.parseToNode(text)
    while node:
        node = node.next

if __name__ == "__main__":
    main()

kokoro.utf8.txt is available at: https://gist.github.com/orangain/d48998bf258d2118d0a6589d06ed7e98

Environment

Note that:


P.S. I really appreciate your effort to provide wheels including fix of #3 and #19.

Mikep86 commented 5 years ago

I can confirm this memory leak. It appears to be related to the parseToNode method.

This code leaks:

import MeCab
tagger = MeCab.Tagger()
node = tagger.parseToNode("text")
while node:
    node = node.next

This code does not leak:

import MeCab
tagger = MeCab.Tagger()
parse_result = tagger.parse("text")
for parse_line in parse_result.split("\n"):
    if parse_line == "EOS":
        break

Environment

polm commented 5 years ago

I can confirm this bug exists. When adding a patch for 0.996.1 we checked to avoid memory leaks, see here, so this was a bit surprising. My linked code to check for memory usage shows no growth with 0.996.1 but shows growth (a leak) with 0.996.2.

I changed the example code above a little. I moved the tagger creation out of the loop and changed it to print memory usage. We should check the tagger later but I wanted to narrow the issue. Here's my code:

import MeCab
import os
import psutil # note: needs to be installed
process = psutil.Process(os.getpid())

tagger = MeCab.Tagger()
tagger.parse('')  # Hack for mecab-python3 0.7

def main():
    with open('kokoro.utf8.txt') as f:
        text = f.read()

    for i in range(10000):
        if i % 10 == 0:
            print(i, process.memory_info().rss)

        parse(tagger, text)

def parse(tagger, text):
    node = tagger.parseToNode(text)
    while node:
        node = node.next

if __name__ == "__main__":
    main()

Here's the start of output with 0.7:

0 15724544
10 273051648
20 278630400
30 284209152
40 289787904
50 295366656
60 300945408
70 306524160
80 312107008
90 317685760
100 323264512

with 0.996.1:

0 15568896
10 272306176
20 277884928
30 283463680
40 289046528
50 294625280
60 300204032
70 305782784
80 311361536
90 316940288
100 322523136
110 328101888

with 0.996.2:

0 18280448
10 236933120
20 313737216
30 390541312
40 467345408
50 544145408
60 620949504
70 697757696
80 774561792
90 851361792
100 928165888

So for all three versions, memory use increases over time, though it's noticeably faster for 0.996.2. I am not sure what the cause of memory increase in 0.996.1 is...

I'll look into the cause of this in more detail later, thanks for pointing it out!

polm commented 5 years ago

I am still looking at this but unfortunately don't understand how to fix it. If anyone with a good knowledge of SWIG would like to step in it'd be much appreciated.

fflexo commented 5 years ago

Although the SWIG code isn't ideal I don't think it's the problem here. %newobject is adding a delete[] call in _wrap_Node_surface_get which looks correct. The sample code is valgrind clean (in so far as there is nothing reported as definitely lost) too.

I spent a while looking at the Node objects, as they don't get deleted after Python is finished with them, but the documentation for parseToNode says that's correct and I have no reason to doubt it.

My best guess right now would be that there's some horrific heap fragmentation happening that's hurting lots, but I have no evidence to support that theory.

akorobko commented 4 years ago

@fflexo @polm I have found the issue. See PR #37 @SamuraiT could you, please, merge it and update package? Thanks!

polm commented 4 years ago

@akorobko Thanks so much for the fix!

To be honest after this bug I got fed up with SWIG and wrote a Cython wrapper for MeCab called fugashi to avoid dealing with it.

zackw commented 4 years ago

Should be fixed by #37.