Feature: Reduce joex emory usage when idling

svenihoney commented 3 years ago

I am aware of the high memory, I have seen #284.

Would be nice if it would be possible to reduce joex's memory usage when idling around. The process is using 1.5G of RAM when idling around most of the time. An option to clear out the ML heap or whatever is using the memory for a use pattern like me (max. 4 document/week) would reduce the memory constraints on my 4G mini server enormously.

eikek commented 3 years ago

Hi @svenihoney , thanks for bringing this up again. Yes, this should be better and better configurable. I'm currently thinking about reorganizing this so one can opt-out from NLP completely, which does require some memory to load the different language models. Currently they are loaded on first use and kept there "forever" (because it takes time).

If you give the jvm process 1.5G memory, it will eventually use it but also frees it when not needed inside its vm rather quickly, so other threads of the process can use it. It also releases it back to the OS eventually, but when this happens is dependent on your architecture, jvm version and settings (used GC etc). It most situations, if the OS needs memory, the vm is informed and will release some if it can.

So if the language model files are garbage-collected at some point, the vm should be able to release more memory back to the OS. However, when and how much etc is not enforced, it is up to the jvm (newer versions should be used here). I'm going to implement with this issue a constraint for holding the memory intensive language model data. For reorganizing the processing, there is #263 .

I'm also using a 4G mini board to run docspell, postgres and solr. It works really well – if not much else is running there. A workaround for your use pattern might be to remove the joex component from the mini server and start it on-demand on your laptop or other more powerful computer. The 24/7 low-power machine can run the database, solr and rest-server which allows to access all your files all the time and to submit processing requests. Actually processing them could be moved to another machine or you could setup a timer that starts joex every sunday or so and shuts it down after some time. I'm aware that these are not solutions, only workarounds :-).

svenihoney commented 3 years ago

Well, it works on my server as well, but it is using more or less the whole swap space ;-)

Currently they are loaded on first use and kept there "forever" (because it takes time).

It would be perfect for me if joex would optionally remove the NLP after it's use. In my low-power-user-profile I doesn't care if the processing takes 1 or 5 minutes because the NLP has to be loaded each time. Also, perhaps some way to set the nice level of the joex process would be good for us with the no-server-on-steroids environment...

eikek commented 3 years ago

Oh you have a 4G RAM board and it has to use a large swap space? I think this is not so for me… I have to check when I'm back home. What architecture is your server?

Setting the nice level of the joex process can be done like with any other process? I'm not sure if I understand :)

svenihoney commented 3 years ago

Since docspell is not the only process running on this device, yes.

And the nice level: If you run the whole thing in docker (as I do), it is not so easy to set the nice level afterwards. Would be some environment variable or so in the docker container. Not so important at least.

eikek commented 3 years ago

Ah I see, thanks. I sometimes forget about docker here :) I'm all for being able to configure these things. I don't know much about docker, so I guess I need some input here. I would have thought that this could be specified to the docker command … (as it might be "dangerous" to have containers set their own niceness? just a thought, have no clue how this is supposed to be with docker)

eikek commented 3 years ago

There is now a timer that clears all the nlp caches after a while (can be configured). This results in this picture:

Selection_098

The jvm can then release memory back to the OS (the orange line shows the part that the jvm process has allocated for it's "heap", which can grow). For java8 I'm using the G1GC garbage collector, which can be enabled with the option -XX:+UseG1GC, for later versions this is the default (I think).

In my tests here (Linux, Java8, G1GC) the Rss memory metric for the jvm process dropped from ~1.8G to ~600M in top.

eikek commented 3 years ago

ftr, I also tested with docker and java11. There the g1gc seems active by default, the docker stats show that memory is reclaimed by the os, where the joex container had 1.3G and after clearing caches 450M.

eikek / docspell

Feature: Reduce joex emory usage when idling #509