eXist-db / exist

eXist Native XML Database and Application Platform
https://exist-db.org
GNU Lesser General Public License v2.1
421 stars 179 forks source link

Memory Leak when restoring a Backup #4890

Open adamretter opened 1 year ago

adamretter commented 1 year ago

When restoring a clean instance of eXist-db 6.2.0 from a backup we are able to observe what we believe to be a memory leak in the Full-Text and Range indexes of eXist-db; to be clear I am referring to the implementations of both of those indexes that are backed by Lucene.

The database backup is approximately 40,000 XML documents (35GB).

Having configured the JVM for eXist-db to -XX:+HeapDumpOnOutOfMemoryError and -XX:+ExitOnOutOfMemoryError, we have captured the heap dumps from the JVM and analyzed them with YourKit Java Profiler.

Heap Dump 1

@dariok provided me with the file: java_pid236689.hprof (md5sum: 54c669ddba7532bee2449922f2fba1d5)

dump1-part1 dump1-part2

Shows that of the 2.9GB available to eXist-db, 1.1GB is retained (at the time that eXist-db ran out of memory) by the RangeIndex.

Heap Dump 2

@dariok provided me with the file: java_pid318495.hprof (md5sum: 9a99f30986fe2579b8c54efe6547e05b)

dump2 Re-running with a clean eXist-db instance and restoring the same backup again but with 6GB of memory available to eXist-db, we see that 2.5GB and 1.1GB are retained (at the time that eXist-db ran out of memory) by the LuceneIndex and RangeIndex respectively.

Observations

  1. The problem is not reproducible with eXist-db 6.0.1.
  2. The git history shows that the only changes to those indexes were the PR https://github.com/eXist-db/exist/pull/4541 which first shipped in eXist-db 6.1.0.

For reference, the collection.xconf in use looks like:

<collection xmlns="http://exist-db.org/collection-config/1.0">
  <index xmlns:mets="http://www.loc.gov/METS/" xmlns:meta="https://github.com/dariok/wdbplus/wdbmeta" xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <lucene>
      <analyzer id="std" class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>

      <text qname="tei:persName"/>
      <text qname="tei:placeName"/>
      <text qname="tei:orgName"/>
      <text qname="tei:title"/>
      <text qname="tei:text"/>
      <text qname="tei:note"/>
      <text qname="tei:p"/>
      <text qname="tei:ab"/>
      <text qname="tei:opener"/>
      <text qname="tei:closer"/>
      <text qname="tei:postscript"/>
      <text qname="tei:cell"/>
      <text qname="tei:label"/>
      <text qname="tei:item"/>
      <text qname="tei:quote"/>
      <text qname="tei:head"/>
      <text qname="tei:l"/>
      <text qname="tei:w"/>

      <inline qname="tei:unclear"/>
      <inline qname="tei:supplied"/>
      <inline qname="tei:gap"/>
      <inline qname="tei:ex"/>
      <inline qname="tei:hi"/>

      <ignore qname="tei:listPerson"/>
      <ignore qname="tei:listPlace"/>
      <ignore qname="tei:listOrg"/>
      <ignore qname="tei:listBibl"/>
      <ignore qname="tei:note"/>

      <ignore qname="tei:pb"/>
      <ignore qname="tei:fw"/>
      <ignore qname="tei:lb"/>
      <ignore qname="tei:pb" /> 
      <ignore qname="tei:pc"/>
    </lucene>
    <range>
      <create qname="mets:file">
        <field name="mets-file-id" match="@ID" type="xs:string"/>
      </create>
      <create qname="meta:file">
        <field name="file-path" match="@path" type="xs:string"/>
      </create>
      <create qname="@ref" type="xs:string"/>
      <create qname="@cRef" type="xs:string"/>
      <create qname="@type" type="xs:string"/>
      <create qname="@place" type="xs:string"/>
      <create qname="tei:rs">
        <field name="rs-type" match="@type" type="xs:string"/>
        <field name="rs-ref" match="@ref" type="xs:string"/>
      </create>
      <create qname="@when" type="xs:date"/>
      <create qname="@notBefore" type="xs:date"/>
      <create qname="@notAfter" type="xs:date"/>
      <create qname="tei:w" type="xs:string"/>
    </range>
  </index>
</collection>
joewiz commented 1 year ago

Are the .xconf files available for reference? There are various options in eXist's Lucene-based full text and range index configuration files. It might be helpful to know which aspects were in play in this case.

dariok commented 1 year ago

@adamretter added the collection.xconf above. For one of my test runs, I reduced that one by removing some (more or less random) parts from both the full text and the range index config but that did not seem to change anything.

If you need any additional info, just let me know.

joewiz commented 1 year ago

Concerned that this issue might affect the release of a forthcoming update to my application, which depends upon the facets and fields feature and fixes included in eXist 6.2.0, I performed some tests to try to assess if my application is affected.

Starting from a fresh, stock instance of eXist 6.2.0 (with all defaults, including 2GB Xmx), I carried out these steps:

  1. Stored all history.state.gov collection configuration files (captured from /db/system/config on a complete hsg-project system, as shown in the attached zip file to ensure that all indexes are configured before any data is stored.
  2. Uploaded all hsg-project repositories via WebDAV to store and index the data. (I then confirmed via monex's indexing pane that a sample facet index was successfully created.)
  3. Triggered a java heap dump via jmap -dump:format=b,file=/tmp/heapdump.bin <pid>.

In analyzing the heap dump, we determined that after indexing is completed, the instance retained about 40MB of memory for Lucene indexes, out of the total 392 MB of retained memory used - about 10%.

Based on these results, it appears that Lucene is holding onto memory that it should release after indexing is complete. While the memory withheld did not cause eXist to crash, it appears the findings are consistent with Dario’s findings, and may demonstrate that a leak is present.

I would note that @dariok's dataset is about 8x larger than mine. (Mine weighs in at 4.37 GB in 37k XML documents, whereas Dario's is 35 GB with 40k documents). His also has a more extensive and granular set of index definitions, including <tei:w> - i.e., every individual word - whereas mine is at the <tei:div> (article) level.

reinhapa commented 1 year ago

I did some more HEAP dump digging and I saw a lot of objects held in the BTreeCache and references of that in that ThreadLocal image

Here are some more details: image

@adamretter I there a way to access this cache using JMX and try to flush this one?

adamretter commented 1 year ago

@reinhapa As per my screenshot at the top I also saw objects in the BTreeCache of about ~1GB retained size, however I believe that to be correct and inline with the sizing of the caches in the conf.xml files.

I don't know of a way to clear the BTreeCache via JMX (see the: org.exist.management.impl package), however it should not be too hard to expose something if you need it.

line-o commented 1 year ago

I was able to reproduce the memory leak using the original dataset provided by @dariok. With enough resources, about 10 GB of RAM, the process did finish successfully. After restoring the entire dataset of 35 GB, containing 26874 XML files (including .html, .xsl, .svg), almost 5 GB of RAM retained.

Screenshot 2023-07-04 at 14 24 36

There is ~59000 instances of TriggerStatePerThread$TriggerState. Since I can see that there is two instances for each restored item that is ~29500 CREATE_DOCUMENT triggers - one for BEFORE and for the AFTER phase.

Screenshot 2023-07-05 at 11 32 00

I was then wondering whether I was missing +3000 documents until I discovered that CREATE_COLLECTION was also invoked:

Screenshot 2023-07-05 at 12 08 21

The original collection.xconf defines an update trigger as such:

<triggers>
    <trigger event="update" class="org.exist.collections.triggers.XQueryTrigger">
        <parameter name="url" value="xmldb:exist:///db/apps/path/to/module.xql"/>
    </trigger>
</triggers>

With the collection.xconf as it is in the issue description (without any triggers configured) the outcome was very different. RAM usage stayed within reasonable bounds (<2 GB) and all memory was freed afterwards (see screenshot).

Screenshot 2023-07-05 at 09 26 17

My analysis indicates that TriggerStates are held in memory indefinitely and are the cause of the leak. This is especially interesting as the events that were triggered are not the ones that are defined and thus one would not expect them to be fired at all.

line-o commented 1 year ago

Just to be clear: Based on my findings, it is unlikely that the changes to fields is related to this issue.

line-o commented 1 year ago

@joewiz does your dataset define triggers?

joewiz commented 1 year ago

@line-o Not in production, but yes in the version that is being prepared to deploy to production: https://github.com/HistoryAtState/frus/blame/master/volumes.xconf#L2-L9. This is the version that I used in the results I shared above.

reinhapa commented 1 year ago

@line-o @adamretter when would/should such a trigger event be processed/completed?

line-o commented 1 year ago

@reinhapa I don't know / have yet to familiarise myself with the Trigger implementation.

adamretter commented 1 year ago

@reinhapa @line-o This is very different to the memory profiles that I showed above. The profiles I posted clearly show the memory usage to be within the RangeIndex and LuceneIndex. Perhaps @line-o has found another different memory leak...

adamretter commented 1 year ago

TriggerStates are held in memory indefinitely

These are Thread Local instances and the intention is that they should be cleared when each Trigger completes - see TriggerStatePerThread#clear().

reinhapa commented 1 year ago

@reinhapa @line-o This is very different to the memory profiles that I showed above. The profiles I posted clearly show the memory usage to be within the RangeIndex and LuceneIndex. Perhaps @line-o has found another different memory leak...

Hmm, when I understand @line-o corectly the import succeeded when using enough memory so far and the Range and Lucene Index free up there memory after the restore right?

If the memory at the end is on the "normal" levels again, I would not directly call that part a memory leak in the classic sense but an uncontrolled temporary overshoot (that we definitly need to look into).

@adamretter those triggers should be processed and removed afterwards right? As for the Fulltext and Range index: Can it be, that those are note "fast" enough in processing and we experience some sort of a memory back pressure there?

line-o commented 1 year ago

@reinhapa correct, with enough resources restoring the backup succeeds with or without triggers configured and only TriggerStates are not removed from memory afterwards.

wolfgangmm commented 1 year ago

Please note that after removing the trigger – while leaving the rest of the configuration untouched – the restore process went smooth and memory consumption stayed within lower bounds as far as I could see in @line-o's reports during testing. The Lucene-based indexes thus seem to behave as expected under normal conditions, i.e. if memory is not under pressure due to other memory leaks.

adamretter commented 1 year ago

It is very strange that we are seeing different things from the same process. I will attempt to reproduce my findings again when I have a moment. If I can reproduce them, perhaps I should post a video, as presumably there must be something different between our approaches to reproducing this.

wolfgangmm commented 1 year ago

Maybe you get different results because you used -XX:+HeapDumpOnOutOfMemoryError, which means that the system was already in a state in which it was fighting for life, thus showing misleading results? @line-o on the other hand made sure the process would end by giving it enough memory, so this could make a difference.

adamretter commented 1 year ago

@wolfgangmm That's an interesting idea! I didn't understand from @line-o's report that that was what he had done. I will try with and without -XX:+HeapDumpOnOutOfMemoryError and see if that makes a difference and report back.

line-o commented 9 months ago

It has been 4 months since I reported my findings. Was anybody able to falsify or reproduce them in the meantime?

dizzzz commented 4 months ago

ping...

adamretter commented 3 months ago

I think there are clearly two different things being reported in this issue:

  1. The memory leak I initially reported in the Full-Text and Range indexes (implemented atop Lucene)
  2. The issue reported by @line-o with regards the TriggerStatePerThread possibly not being cleared correctly.

I will try and find the Heap Dump that @dariok sent me previously and take a look into (1) as soon as I have some free time. The first thing I will do is provide a checksum of the heap dump file, and clearer steps to reproduce the issue.

For (1) if we assume that @line-o used the same Heap Dump from @dariok as I did, then I think he must have taken different steps to me as we got quite different results. If @line-o could provide a checksum of his heap dump file (so we can ensure we are using the same heap dumps), and exact steps to reproduce what he is seeing, I could then look into that also...

dizzzz commented 2 months ago

Hi Adam, would you be available to discuss this subject in an upcoming community call?

line-o commented 2 months ago

I am not using a heap dump, but reproduced the issue by restoring the backup into a instance with 16 GB of RAM

reinhapa commented 2 months ago

I think there are clearly two different things being reported in this issue:

  1. The memory leak I initially reported in the Full-Text and Range indexes (implemented atop Lucene)
  2. The issue reported by @line-o with regards the TriggerStatePerThread possibly not being cleared correctly.

@adamretter im my tests using the backup file given to my by @line-o and @dariok I could observe only memory issues, when there where triggers activated with the backup file. I would very much like to discuss the potential further tests and possible solutions in a upcoming community call as suggested by @dizzzz. I find it hard to discuss it here.

adamretter commented 2 months ago

The first thing I will do is provide a checksum of the heap dump file

I have added the exact details now at the top of this thread in the description of the issue.

Hi Adam, would you be available to discuss this subject in an upcoming community call?

@dizzzz No, I will not attend the community calls. I stopped attending them as it became a toxic environment where I no longer felt welcome; I raised that issue with several people, and nothing has been done about it. If you want to discuss this issue with me, you are welcome to contact me directly.

@reinhapa You are also welcome to contact me directly.

I think as I stated above, there are clearly two different technical issues. The issue that I opened is about the memory leak with Lucene in eXist-db versions after 6.0.1. The two heap dump files that I was provided that clearly show this are detailed at the top of this issue.

I think the potential other issue that @line-o identified should be moved into its own new issue as it is different to this one, and conflating them in this single issue thread just leads to confusion.

adamretter commented 1 month ago

@line-o @reinhapa Would you be able to open a new issue for the other memory leak you believe you are seeing please, and copy all details there and add anything else that is needed to reproduce it step-by-step. I would like to see if I can also reproduce that one...

line-o commented 1 month ago

I am all for it @adamretter, will do. For the record: I can not reproduce the issue you describe.

adamretter commented 1 month ago

I can not reproduce the issue you describe.

Yes, I was already clear on that. I am not sure why, as I and others can reproduce it. Anyway, I will work on some way of presenting further evidence... perhaps a screen recording showing the issue

line-o commented 1 month ago

I haven't seen anyone else commenting here that they were able to reproduce the issue.

adamretter commented 1 month ago

I haven't seen anyone else commenting here that they were able to reproduce the issue.

Well there is clearly two people reporting this here - me and Dario. In addition there are some customers of ours (Evolved Binary) who won't comment on GitHub issues...

line-o commented 1 month ago

The issue as described here did not lead to a memory leak in my testing. The trigger definition needs to be added, which is the case in @dariok's original backup.

adamretter commented 1 month ago

It does lead to a memory leak in my and others testing even without the triggers being defined.

I don't see how repeating yourself that you can't see the issue helps anyone at all. Let's move forward please - as requested please feel free to open a separate issue for the separate thing you are seeing @line-o

line-o commented 1 month ago

@dariok since you were the first person to notice a problem when you restored your backup: did you retest without triggers?

adamretter commented 1 week ago

@line-o Separately to this issue, we have diagnosed your Trigger issue and have a bugfix ready to send in. As I previously requested ([1], [2], and [3]), please could you open a new issue for your Trigger issue, we can then send a PR for you with the fix marked as closing that issue.

adamretter commented 6 days ago

@reinhapa @dizzzz I haven't heard back from @line-o about this, do you know if he is on holiday? We would like to get this fix sent in ASAP!

line-o commented 6 days ago

@adamretter It is correct that I am on holidays and I won't be able to open a separate issue this week. I believe I will find time to do this next week, however. If you want to publish your PR in the meantime @dizzzz and @reinhapa may already be able to review the fix and the issue number can then be added afterwards.

adamretter commented 5 days ago

@line-o we will await your issue next week