Closed orangain closed 4 years ago
I can confirm this memory leak. It appears to be related to the parseToNode
method.
This code leaks:
import MeCab
tagger = MeCab.Tagger()
node = tagger.parseToNode("text")
while node:
node = node.next
This code does not leak:
import MeCab
tagger = MeCab.Tagger()
parse_result = tagger.parse("text")
for parse_line in parse_result.split("\n"):
if parse_line == "EOS":
break
I can confirm this bug exists. When adding a patch for 0.996.1 we checked to avoid memory leaks, see here, so this was a bit surprising. My linked code to check for memory usage shows no growth with 0.996.1 but shows growth (a leak) with 0.996.2.
I changed the example code above a little. I moved the tagger creation out of the loop and changed it to print memory usage. We should check the tagger later but I wanted to narrow the issue. Here's my code:
import MeCab
import os
import psutil # note: needs to be installed
process = psutil.Process(os.getpid())
tagger = MeCab.Tagger()
tagger.parse('') # Hack for mecab-python3 0.7
def main():
with open('kokoro.utf8.txt') as f:
text = f.read()
for i in range(10000):
if i % 10 == 0:
print(i, process.memory_info().rss)
parse(tagger, text)
def parse(tagger, text):
node = tagger.parseToNode(text)
while node:
node = node.next
if __name__ == "__main__":
main()
Here's the start of output with 0.7:
0 15724544
10 273051648
20 278630400
30 284209152
40 289787904
50 295366656
60 300945408
70 306524160
80 312107008
90 317685760
100 323264512
with 0.996.1:
0 15568896
10 272306176
20 277884928
30 283463680
40 289046528
50 294625280
60 300204032
70 305782784
80 311361536
90 316940288
100 322523136
110 328101888
with 0.996.2:
0 18280448
10 236933120
20 313737216
30 390541312
40 467345408
50 544145408
60 620949504
70 697757696
80 774561792
90 851361792
100 928165888
So for all three versions, memory use increases over time, though it's noticeably faster for 0.996.2. I am not sure what the cause of memory increase in 0.996.1 is...
I'll look into the cause of this in more detail later, thanks for pointing it out!
I am still looking at this but unfortunately don't understand how to fix it. If anyone with a good knowledge of SWIG would like to step in it'd be much appreciated.
Although the SWIG code isn't ideal I don't think it's the problem here. %newobject
is adding a delete[]
call in _wrap_Node_surface_get
which looks correct. The sample code is valgrind clean (in so far as there is nothing reported as definitely lost) too.
I spent a while looking at the Node
objects, as they don't get deleted after Python is finished with them, but the documentation for parseToNode
says that's correct and I have no reason to doubt it.
My best guess right now would be that there's some horrific heap fragmentation happening that's hurting lots, but I have no evidence to support that theory.
@fflexo @polm I have found the issue. See PR #37 @SamuraiT could you, please, merge it and update package? Thanks!
@akorobko Thanks so much for the fix!
To be honest after this bug I got fed up with SWIG and wrote a Cython wrapper for MeCab called fugashi to avoid dealing with it.
Should be fixed by #37.
When parsing large text many times with mecab-python3 0.996.2, python process consumes too many memories. Finally, the process falls into significant slow down.
For example, after about 200 times of loop when running the following code, python process consumes 80% of machine's memory. Consumed memory does not seem to be released and the process does not make progress at all.
kokoro.utf8.txt
is available at: https://gist.github.com/orangain/d48998bf258d2118d0a6589d06ed7e98Environment
Note that:
P.S. I really appreciate your effort to provide wheels including fix of #3 and #19.