SamuraiT / mecab-python3

:snake: mecab-python. you can find original version here:http://taku910.github.io/mecab/
https://pypi.python.org/pypi/mecab-python3
Other
541 stars 51 forks source link

Make node.surface return just the token (fixes #19) #22

Closed polm closed 5 years ago

polm commented 5 years ago

For details on the fix see #19, particularly comments by @tomoyukih-fw.

polm commented 5 years ago

Thank you very much for your quick reply.

Good point about leaking memory, but my understanding is that making sure the new string gets cleaned up is the purpose of the newobject directive:

If you wrap the function blah(), SWIG has no idea that the return value is a newly allocated object. As a result, the resulting extension module may produce a memory leak (SWIG is conservative and will never delete objects unless it knows for certain that the returned object was newly created). ... When %newobject is supplied, many language modules will arrange to take ownership of the return value. This allows the value to be automatically garbage-collected when it is no longer in use. However, this depends entirely on the target language (a language module may also choose to ignore the %newobject directive).

So I think this is OK. I can write a test or check more closely if you like though.

No worries about looking at it this weekend or later; since the old version works this isn't urgent for me by any means. Thanks again for your quick reply.

zackw commented 5 years ago

If you can somehow confirm that %newobject does cause the intermediate C-string to be deallocated, that would satisfy me for now. I would like to ask for a test program but we don't actually have a test suite right now (which is a problem in itself, but not one we need to address at the same time as this regression).

polm commented 5 years ago

Would showing a loop with the same operation has constant memory usage be adequate? This seems to work:

import MeCab
tokenizer = MeCab.Tagger()

import os
import psutil # note: needs to be installed
process = psutil.Process(os.getpid())

for ii in range(100000):
    node = tokenizer.parseToNode('日本語ですよ').next
    a = node.surface

    if (ii + 1) % 10000 == 0:
        # This should print memory usage 
        print(process.memory_info().rss)
        # if there is no memory leak the number should not change

I need to go to bed now, but if this isn't adequate I could look into hunting down the implementation of newobject in swig - otherwise not really sure how to prove the pointer is deallocated.

zackw commented 5 years ago

That will do; I can run that program under valgrind or something for an even more stringent test. Thanks. Again, not going to actually merge the PR until I have time to dig into the changes a bit more myself, but I think we're on the right track.

zackw commented 5 years ago

I have now thoroughly verified that this code does not produce a memory leak. Also, on further investigation, this code (with trivial formatting changes) is in the latest version of the upstream MeCab.i, so clearly it is the intended API.

I've merged your PR and then committed a patch on top of it to make the formatting match the upstream file. There will be a new release on PyPI Real Soon Now (I need to sort out automatic building of binary wheels, first).