Open adamretter opened 1 year ago
Are the .xconf files available for reference? There are various options in eXist's Lucene-based full text and range index configuration files. It might be helpful to know which aspects were in play in this case.
@adamretter added the collection.xconf
above. For one of my test runs, I reduced that one by removing some (more or less random) parts from both the full text and the range index config but that did not seem to change anything.
If you need any additional info, just let me know.
Concerned that this issue might affect the release of a forthcoming update to my application, which depends upon the facets and fields feature and fixes included in eXist 6.2.0, I performed some tests to try to assess if my application is affected.
Starting from a fresh, stock instance of eXist 6.2.0 (with all defaults, including 2GB Xmx), I carried out these steps:
/db/system/config
on a complete hsg-project system, as shown in the attached zip file to ensure that all indexes are configured before any data is stored.jmap -dump:format=b,file=/tmp/heapdump.bin <pid>
.In analyzing the heap dump, we determined that after indexing is completed, the instance retained about 40MB of memory for Lucene indexes, out of the total 392 MB of retained memory used - about 10%.
Based on these results, it appears that Lucene is holding onto memory that it should release after indexing is complete. While the memory withheld did not cause eXist to crash, it appears the findings are consistent with Dario’s findings, and may demonstrate that a leak is present.
I would note that @dariok's dataset is about 8x larger than mine. (Mine weighs in at 4.37 GB in 37k XML documents, whereas Dario's is 35 GB with 40k documents). His also has a more extensive and granular set of index definitions, including <tei:w>
- i.e., every individual word - whereas mine is at the <tei:div>
(article) level.
I did some more HEAP dump digging and I saw a lot of objects held in the BTreeCache
and references of that in that ThreadLocal
Here are some more details:
@adamretter I there a way to access this cache using JMX and try to flush this one?
@reinhapa As per my screenshot at the top I also saw objects in the BTreeCache of about ~1GB retained size, however I believe that to be correct and inline with the sizing of the caches in the conf.xml
files.
I don't know of a way to clear the BTreeCache
via JMX (see the: org.exist.management.impl
package), however it should not be too hard to expose something if you need it.
I was able to reproduce the memory leak using the original dataset provided by @dariok. With enough resources, about 10 GB of RAM, the process did finish successfully. After restoring the entire dataset of 35 GB, containing 26874 XML files (including .html, .xsl, .svg), almost 5 GB of RAM retained.
There is ~59000 instances of TriggerStatePerThread$TriggerState
.
Since I can see that there is two instances for each restored item that is ~29500 CREATE_DOCUMENT
triggers - one for BEFORE
and for the AFTER
phase.
I was then wondering whether I was missing +3000 documents until I discovered that CREATE_COLLECTION
was also invoked:
The original collection.xconf
defines an update trigger as such:
<triggers>
<trigger event="update" class="org.exist.collections.triggers.XQueryTrigger">
<parameter name="url" value="xmldb:exist:///db/apps/path/to/module.xql"/>
</trigger>
</triggers>
With the collection.xconf
as it is in the issue description (without any triggers configured) the outcome was very different.
RAM usage stayed within reasonable bounds (<2 GB) and all memory was freed afterwards (see screenshot).
My analysis indicates that TriggerStates are held in memory indefinitely and are the cause of the leak. This is especially interesting as the events that were triggered are not the ones that are defined and thus one would not expect them to be fired at all.
Just to be clear: Based on my findings, it is unlikely that the changes to fields is related to this issue.
@joewiz does your dataset define triggers?
@line-o Not in production, but yes in the version that is being prepared to deploy to production: https://github.com/HistoryAtState/frus/blame/master/volumes.xconf#L2-L9. This is the version that I used in the results I shared above.
@line-o @adamretter when would/should such a trigger event be processed/completed?
@reinhapa I don't know / have yet to familiarise myself with the Trigger implementation.
@reinhapa @line-o This is very different to the memory profiles that I showed above. The profiles I posted clearly show the memory usage to be within the RangeIndex and LuceneIndex. Perhaps @line-o has found another different memory leak...
TriggerStates are held in memory indefinitely
These are Thread Local instances and the intention is that they should be cleared when each Trigger completes - see TriggerStatePerThread#clear()
.
@reinhapa @line-o This is very different to the memory profiles that I showed above. The profiles I posted clearly show the memory usage to be within the RangeIndex and LuceneIndex. Perhaps @line-o has found another different memory leak...
Hmm, when I understand @line-o corectly the import succeeded when using enough memory so far and the Range and Lucene Index free up there memory after the restore right?
If the memory at the end is on the "normal" levels again, I would not directly call that part a memory leak in the classic sense but an uncontrolled temporary overshoot (that we definitly need to look into).
@adamretter those triggers should be processed and removed afterwards right? As for the Fulltext and Range index: Can it be, that those are note "fast" enough in processing and we experience some sort of a memory back pressure there?
@reinhapa correct, with enough resources restoring the backup succeeds with or without triggers configured and only TriggerStates are not removed from memory afterwards.
Please note that after removing the trigger – while leaving the rest of the configuration untouched – the restore process went smooth and memory consumption stayed within lower bounds as far as I could see in @line-o's reports during testing. The Lucene-based indexes thus seem to behave as expected under normal conditions, i.e. if memory is not under pressure due to other memory leaks.
It is very strange that we are seeing different things from the same process. I will attempt to reproduce my findings again when I have a moment. If I can reproduce them, perhaps I should post a video, as presumably there must be something different between our approaches to reproducing this.
Maybe you get different results because you used -XX:+HeapDumpOnOutOfMemoryError
, which means that the system was already in a state in which it was fighting for life, thus showing misleading results? @line-o on the other hand made sure the process would end by giving it enough memory, so this could make a difference.
@wolfgangmm That's an interesting idea! I didn't understand from @line-o's report that that was what he had done. I will try with and without -XX:+HeapDumpOnOutOfMemoryError
and see if that makes a difference and report back.
It has been 4 months since I reported my findings. Was anybody able to falsify or reproduce them in the meantime?
ping...
I think there are clearly two different things being reported in this issue:
I will try and find the Heap Dump that @dariok sent me previously and take a look into (1) as soon as I have some free time. The first thing I will do is provide a checksum of the heap dump file, and clearer steps to reproduce the issue.
For (1) if we assume that @line-o used the same Heap Dump from @dariok as I did, then I think he must have taken different steps to me as we got quite different results. If @line-o could provide a checksum of his heap dump file (so we can ensure we are using the same heap dumps), and exact steps to reproduce what he is seeing, I could then look into that also...
Hi Adam, would you be available to discuss this subject in an upcoming community call?
I am not using a heap dump, but reproduced the issue by restoring the backup into a instance with 16 GB of RAM
I think there are clearly two different things being reported in this issue:
- The memory leak I initially reported in the Full-Text and Range indexes (implemented atop Lucene)
- The issue reported by @line-o with regards the TriggerStatePerThread possibly not being cleared correctly.
@adamretter im my tests using the backup file given to my by @line-o and @dariok I could observe only memory issues, when there where triggers activated with the backup file. I would very much like to discuss the potential further tests and possible solutions in a upcoming community call as suggested by @dizzzz. I find it hard to discuss it here.
The first thing I will do is provide a checksum of the heap dump file
I have added the exact details now at the top of this thread in the description of the issue.
Hi Adam, would you be available to discuss this subject in an upcoming community call?
@dizzzz No, I will not attend the community calls. I stopped attending them as it became a toxic environment where I no longer felt welcome; I raised that issue with several people, and nothing has been done about it. If you want to discuss this issue with me, you are welcome to contact me directly.
@reinhapa You are also welcome to contact me directly.
I think as I stated above, there are clearly two different technical issues. The issue that I opened is about the memory leak with Lucene in eXist-db versions after 6.0.1. The two heap dump files that I was provided that clearly show this are detailed at the top of this issue.
I think the potential other issue that @line-o identified should be moved into its own new issue as it is different to this one, and conflating them in this single issue thread just leads to confusion.
@line-o @reinhapa Would you be able to open a new issue for the other memory leak you believe you are seeing please, and copy all details there and add anything else that is needed to reproduce it step-by-step. I would like to see if I can also reproduce that one...
I am all for it @adamretter, will do. For the record: I can not reproduce the issue you describe.
I can not reproduce the issue you describe.
Yes, I was already clear on that. I am not sure why, as I and others can reproduce it. Anyway, I will work on some way of presenting further evidence... perhaps a screen recording showing the issue
I haven't seen anyone else commenting here that they were able to reproduce the issue.
I haven't seen anyone else commenting here that they were able to reproduce the issue.
Well there is clearly two people reporting this here - me and Dario. In addition there are some customers of ours (Evolved Binary) who won't comment on GitHub issues...
The issue as described here did not lead to a memory leak in my testing. The trigger definition needs to be added, which is the case in @dariok's original backup.
It does lead to a memory leak in my and others testing even without the triggers being defined.
I don't see how repeating yourself that you can't see the issue helps anyone at all. Let's move forward please - as requested please feel free to open a separate issue for the separate thing you are seeing @line-o
@dariok since you were the first person to notice a problem when you restored your backup: did you retest without triggers?
@reinhapa @dizzzz I haven't heard back from @line-o about this, do you know if he is on holiday? We would like to get this fix sent in ASAP!
@adamretter It is correct that I am on holidays and I won't be able to open a separate issue this week. I believe I will find time to do this next week, however. If you want to publish your PR in the meantime @dizzzz and @reinhapa may already be able to review the fix and the issue number can then be added afterwards.
@line-o we will await your issue next week
When restoring a clean instance of eXist-db 6.2.0 from a backup we are able to observe what we believe to be a memory leak in the Full-Text and Range indexes of eXist-db; to be clear I am referring to the implementations of both of those indexes that are backed by Lucene.
The database backup is approximately 40,000 XML documents (35GB).
Having configured the JVM for eXist-db to
-XX:+HeapDumpOnOutOfMemoryError
and-XX:+ExitOnOutOfMemoryError
, we have captured the heap dumps from the JVM and analyzed them with YourKit Java Profiler.Heap Dump 1
@dariok provided me with the file:
java_pid236689.hprof
(md5sum:54c669ddba7532bee2449922f2fba1d5
)Shows that of the 2.9GB available to eXist-db, 1.1GB is retained (at the time that eXist-db ran out of memory) by the
RangeIndex
.Heap Dump 2
@dariok provided me with the file:
java_pid318495.hprof
(md5sum:9a99f30986fe2579b8c54efe6547e05b
)Re-running with a clean eXist-db instance and restoring the same backup again but with 6GB of memory available to eXist-db, we see that 2.5GB and 1.1GB are retained (at the time that eXist-db ran out of memory) by the
LuceneIndex
andRangeIndex
respectively.Observations
For reference, the
collection.xconf
in use looks like: