SamuraiT / mecab-python3

:snake: mecab-python. you can find original version here:http://taku910.github.io/mecab/
https://pypi.python.org/pypi/mecab-python3
Other
541 stars 51 forks source link

Memory leak: delete PyObjects when those are not needed anymore #37

Closed akorobko closed 4 years ago

akorobko commented 4 years ago

This is to fix issue #34

When SWIG generates wrappers, currently it is instructed to not generate default constructors and destructors for the Node and Path python objects. That causes SwigPyBuiltin_BadDealloc() function to be called when garbage collector tries to remove object. That function does nothing, so, PyObject leaks.

Instead, what we need is to call SwigPyBuiltin_destructor_closure(), which deletes PyObject, but keeps payload (data structure returned by MeCab itself) intact (since SWIG is instructed to not own it).

To facilitate this, we should replace

%nodefault mecab_path_t;
%nodefault mecab_node_t;

with

%nodefaultctor mecab_path_t;
%nodefaultctor mecab_node_t;

Here is the test code to check that it works.

import MeCab
import os
import psutil

process = psutil.Process(os.getpid())

def parse(tagger, text):
    node = tagger.parseToNode(text)
    while node:
        node = node.next

tagger = MeCab.Tagger('')
tagger.parse('')  # Hack for mecab-python3 0.7
for i in range(1000000):
    if i % 10000 == 0:
        print(f'{process.memory_info().data/(1024*1024):.2f}')

    parse(tagger, '日本語ですよ')

Before fix, the memory consumption grows from 7Mb to almost 200Mb. After fix, the consumption is stable.

polm commented 4 years ago

Thanks for the fix! I notice that one of the build checks failed, the log is huge but this is the important part:

gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/tmp/pip-req-build-e93f250n/build/libmecab/mecab/src -I/opt/_internal/cpython-3.8.0/include/python3.8 -c src/MeCab/MeCab_wrap.cpp -o build/temp.linux-x86_64-3.8/src/MeCab/MeCab_wrap.o -Wno-unused-variable
  src/MeCab/MeCab_wrap.cpp: In function ‘void SwigPyStaticVar_dealloc(PyDescrObject*)’:
  src/MeCab/MeCab_wrap.cpp:3120:3: error: ‘_PyObject_GC_UNTRACK’ was not declared in this scope
     _PyObject_GC_UNTRACK(descr);
     ^~~~~~~~~~~~~~~~~~~~
  src/MeCab/MeCab_wrap.cpp:3120:3: note: suggested alternative: ‘PyObject_GC_UnTrack’
     _PyObject_GC_UNTRACK(descr);
     ^~~~~~~~~~~~~~~~~~~~
     PyObject_GC_UnTrack

I don't entirely follow what's going on, but it looks like this might be due to a change in the Python API in 3.8. In that case I guess the cause here is SWIG not supporting Python 3.8. Based on the SWIG changelog it looks like it can be fixed with a more recent version of SWIG (the test seems to be using SWIG 3.0).

akorobko commented 4 years ago

@polm Hi, Paul. I have upgraded SWIG to 4.0.1 and it seems to be fine now. Surprisingly, SWIG authors did not mark it as released, so, you cannot find it prebuilt anywhere. But for the travis build, it works fine. Let me know if you can merge. Thanks!

polm commented 4 years ago

Thanks Alexey, unfortunately I am not actually a maintainer of this project and cannot merge the PR.

akorobko commented 4 years ago

@SamuraiT, @zackw Zack, Tatsuro, could you help with this?

akorobko commented 4 years ago

@polm Thanks. Do you know who would be the right person to reach?

polm commented 4 years ago

Zack and Tatsuro are the only maintainers as far as I know.

akorobko commented 4 years ago

Now build fails due to some travis internal issues. How do I re-trigger it?

KoichiYasuoka commented 4 years ago

Automatic download from sourceforge.net often fails. How about http://prdownloads.sourceforge.net/swig/swig-4.0.1.tar.gz instead?

akorobko commented 4 years ago

@KoichiYasuoka That is a good idea. I am not that familiar with sourceforge. Actually was quite surprised it is still first choice platform for some projects. Making change now. Are you maintainer?

KoichiYasuoka commented 4 years ago

I'm not a maintainer here myself. I'm a user of mecab-python3, especially for my UD-Kanbun. I wish @SamuraiT and @zackw be aware of this PR.

akorobko commented 4 years ago

I wish @SamuraiT and @zackw be aware of this PR.

I have tried to reach them over the email (from their web sites respectively), but no luck. Looks like this project is completely abandoned.

polm commented 4 years ago

Well that's unfortunate but thanks for trying.

I would recommend using an alternative like my fugashi or natto-py, but for now I forked this repo and merged your PR so people can install from source if they want. Not sure if it makes sense to register a new name on pypi.

Based on the standards for abandonment this project can't be claimed on PyPI until a year after the last release, which would be April 2020.

akorobko commented 4 years ago

@polm Thanks! Unfortunately, in my case, I need package which is installable from PyPI without requiring that MeCab itself being previously compiled and installed. That is the reason I do not switch to fugashi myself.

I will give little bit more time for maintainers to notice this PR. They seem to be active on GitHub, but for some reason ignoring this project and all related correspondence. Maybe they are on vacation or something like that...

KoichiYasuoka commented 4 years ago

Thank you @polm for your fork with this PR. However, I think your fugashi is more suitable if fugashi can include build-bundled-libmecab.py for the environments without mecab-config...

polm commented 4 years ago

Good point, I opened an issue for bundling libmecab. Not sure when I'll get a chance to work on it but I'll keep it in mind at least.

zackw commented 4 years ago

I want to apologize for ignoring this PR for so long. I still don't have the time required to work on mecab-python3 properly. I also don't understand SWIG well enough to know whether this is the correct fix for the memory leak. And I'm also really hesitant to merge changes to the SWIG file because the SWIG file was originally copied verbatim from the MeCab library itself and so bugs in the SWIG file should ideally be addressed there ... but that project is pretty much dead too.

And on top of all that, I've been ignoring @polm's request to become a co-maintainer for almost a year now because I can't act on it. Only @SamuraiT has the ability to grant people privileges on either this repository or the PyPI package. I haven't heard from him in years myself. I can forward the request but I doubt it will do any good.

Put it all together and I'm inclined to say that you should wait until April, reclaim the PyPI package, and declare https://github.com/polm/mecab-python3 the new official repo for it. I'll cooperate with that to the extent I can.

akorobko commented 4 years ago

I do not have any words to comment this.

akorobko commented 4 years ago

Travis is failing again. This time it cannot fine flake8...

zackw commented 4 years ago

Travis is failing again. This time it cannot fine flake8...

It seems to be finding flake8 just fine; the problem is that flake8 is objecting to one of your changes:

$ /home/travis/build/SamuraiT/mecab-python3/.tox/py27/bin/flake8 . 
./scripts/build_requirements.py:12:80: E501 line too long (202 > 79 characters)
ERROR: InvocationError: '/home/travis/build/SamuraiT/mecab-python3/.tox/py27/bin/flake8 .'
py27 finish: runtests after 0.41 seconds

The important line here is the second one: scripts/build_requirements.py has a line that's too long.

akorobko commented 4 years ago

Thanks! Trying to fight this now. Update: build is passing now.

polm commented 4 years ago

Hey @zackw, thanks for responding! If you're able to merge things that puts us in a much better situation.

I think this is a good PR and should be merged.

The script in the first post makes it clear this resolves the memory use issues, and explanation of why it works is also good. Your point about this being an upstream issue makes sense but this repo only exists because upstream MeCab is abandoned. (The Sourceforge URL was a good catch though!)

Not sure it'll help but I tried mailing @SamuraiT myself. I'll see if maybe there's another way to get in touch with him. If not we can try asking PyPI about what to do, since the abandonment guidelines don't seem to deal with a half-abandoned state.

Anyway, thanks again @zackw for making time to deal with this. Hopefully things will be cleaned up before too long.

zackw commented 4 years ago

Merged. I can do a new PyPI release later today, but ... should I? Would that reset the clock on the PyPI abandonment process?

polm commented 4 years ago

My guess is it would reset the clock on the abandonment process... :/

I'm inclined to say you should release it anyway, but maybe we can wait for a bit to see if SamuraiT replies and also to create an issue at pypa/pypi-support explaining the situation and asking what we should do.

SamuraiT commented 4 years ago

@polm , @zackw , @akorobko I apologize that I didn't open an email and github for a while. I really didn't noticed the PR and the message. sorry for inconvenience guys, I set @zackw as the owner of mecab-python3 in pypi.

Since I'm inactive now, so it's great idea to transfer the owner. I want transfer to @zackw or create organization and transfer to it.

what would be the best ? I want to ask for the ideas. any ideas welcomed either create & transfer to organization or to @zackw.

zackw commented 4 years ago

@SamuraiT I think it would be better if you transfer ownership of this repo and the pypi package to @polm. I can only rarely find time to work on mecab. @polm is also much more knowledgeable about Japanese tokenization, so he will be a better maintainer than I can be.

SamuraiT commented 4 years ago

@zackw thank you for advice. let me transfer to @polm then. @polm, could you tell me your username of pypi ? I'lll add you as an owner.

I found you on pypi so I added

akorobko commented 4 years ago

@SamuraiT, @polm, @zackw Thank you!

akorobko commented 4 years ago

@polm will you do a new PyPI release soon?

polm commented 4 years ago

@SamuraiT Thank you very much! I'm glad we waited before asking about abandonment.

@akorobko I'll make a new release today or tomorrow with this fix. Thanks again for figuring out the SWIG issue!

akorobko commented 4 years ago

@polm Thank you! Ping me if you ever encounter any other issues with this wrapper.

polm commented 4 years ago

I just released 0.996.3 and confirmed the test script in the first post in this thread runs as expected, showing the memory issue is resolved. Thanks again to @akorobko for the fix and thanks to everyone involved in getting this out!

KoichiYasuoka commented 4 years ago

Thank you @polm and I've tested 0.996.3, however, in Debian without swig the installation fails:

swigging src/MeCab/MeCab.i to src/MeCab/MeCab_wrap.cpp
swig -python -O -builtin -c++ -py3 -I/usr/include -o src/MeCab/MeCab_wrap.cpp src/MeCab/MeCab.i
unable to execute 'swig': No such file or directory
error: command 'swig' failed with exit status 1

Umm... automatic download of swig-4.0.1 does not work well...

KoichiYasuoka commented 4 years ago

Oops, mecab-python3-0.996.3 is now distributed by tar.gz (source distribution) only. Yes, now I understand why I need swig separately installed before, since there is no binary wheel. But I think it's incompatible (in installing) from 0.996.2...

polm commented 4 years ago

@KoichiYasuoka Thanks for the report, I opened a new issue at #39 to avoid cluttering this one, let's move discussion there.