ahmetaa / zemberek-nlp

NLP tools for Turkish.
Other
1.14k stars 208 forks source link

High memory usage question #236

Open ilkerhk opened 4 years ago

ilkerhk commented 4 years ago

Hello,

I mapped a global shortcut to a small script which gets the X-selection, corrects it, and pastes the corrected text back. The script uses your library which works great. Thanks for this nice API, I am very happy with it.

However, I realized that it uses about 1GB of memory and I keep the program running all the time in the background for fast response time. I am only calling "normalizer" (I added details below, maybe I am doing something wrong).

1GB seems high to me, is that normal? Is there a way to reduce this? My java is rusty, but maybe there is a way to load only the part that is needed ( which is normalizer).

Thanks.

Running this in the background : java -classpath turkcelestir/zemberek-full.jar:./turkcelestir trCorrIlk

and here trCorrIlk is a small java program which calls zemberek API as below: strP = normalizer.normalize(str);

mdakin commented 4 years ago

@ilkerhk Afaik, normalization depends on a language model which could use quite a bit of memory. @ahmetaa can you confirm? Also, normalization and similar tasks may have higher initialization time / requirements, so they might be better suited to be used as long running tasks or servers.

ahmetaa commented 4 years ago

@ilkerhk @mdakin is right. Normalization may consume more memory because it loads a bigram language model and some large lookups in memory. Even though language model uses succinct data structures, it will still use up at least 100MB of memory. Another culprit is the spelling graph. There is a spelling checker used in normalization which takes a lot of memory because it is not really memory optimized.

I tested the latest grpc server with thousands of lines of sentences applying morphological analysis and normalization operations. Here is the result:

mem-zem

I can say it is kind of expected to have this amount of memory. If I do not use normalization at all (running server without data root). graphic becomes:

mem-zem-no-norm

It is clearly evident that without normalization system uses much less memory perhaps at most around 250 MB.

mdakin commented 4 years ago

@ahmetaa The language model memory usage is probably larger than 100MB (model file is 80MB, I am assuming it does not map 1:1 in memory. From your graphs just normalization adds 1GB extra to zemberek, it could be interesting to see a breakdown of the usage.

ilkerhk commented 3 years ago

I've been using your tool for more than a year now. Here is how I use it on my PC: Select text (that is typed in the English alphabet), press a keyboard shortcut which calls Zemberek normalizer which deasciifies and corrects text. It works great and in fact, it became one of the few indispensable tools in my workflow. Thank you for this nice tool.

However, I have few suggestions:

1- If a word starts with an uppercase letter then correct it but do not change the case. Even better give an option for that. 2- If a word has more than two uppercase letters then do not correct it at all, for example, TDED, TUSAS, LaTeX, etc. 3- Provide a workaround for the high memory usage to make it usable as a standalone PC app. (This can be done by setting null to high memory using variables and then calling garbage collector if the class hasn't been called for some time. In the next call, those variables need to be initiated but that is fast. See my workaround, mem drops to a few MB and the next call is done in about 2 secs. )

I wrote a bash script and a java wrapper that provides me the above 3 things. I can share it if you want. But those are just a work-around as I don't know the internals of Zemberek. I believe developers can you can integrate a better solution that provides the above features.