OxalisCommunity / Oxalis-AS4

PEPPOL AS4 pMode plugin for Oxalis
32 stars 25 forks source link

Memory leak in OxalisAlgorithmSuiteLoader #120

Closed jonassss closed 3 years ago

jonassss commented 4 years ago

Hi,

We pulled the newest commit to potentially upgrade our solution. (commit: cf55f82) However, during stress testing the solution eventually crashed. The test consists of sending a large amount of equal requests.

We also tried stopping the test before the solution crashed and observing the memory usage of the application over time. Expected behavior would be that the consumed memory of the application would drop after a period of inactivity, which we are not observing.

Screenshot 2020-04-23 at 10 59 21

Based on the two tests, we believe that there is a memory leak in the objects contained in the OxalisAlgorithmSuiteLoader.BUS_MAP. We took a heap dump of the application after 12 hours of inactivity, the results of this object is as follows:

no.difi.oxalis.as4.util.OxalisAlgorithmSuiteLoader.BUS_MAP - 592,130K
    org.apache.cxf.bus.extension.ExtensionManagerBus.extensions - 499,161K
          java.util.concurrent.ConcurrentHashMap.values - 489,030K
                org.apache.cxf.ws.policy.PolicyInterceptorProviderRegistryImpl.entries - 241,036K
                org.apache.cxf.ws.policy.AssertionBuilderRegistryImpl.registeredBuilders - 137,869K
                org.apache.cxf.wsdl11.WSDLManagerImpl.registry - 62,368K
                (100 more references totalling to 47,750K)
     org.apache.cxf.bus.extension.ExtensionManagerBus.extensionManager - 86,956K
    (9 more references totalling to 6011K)

Note that we could be wrong in our assumption of the memory leak. In any case, the application does not seem to recover from large amounts of requests - meaning we cannot introduce the newest version to production.

FrodeBjerkholt commented 4 years ago

Thanks for the thorough bug report - I will look into it. A possible solution is probably to replace the hashmap with a cache.

jonassss commented 4 years ago

Forgot to mention that during tests, we saw increasingly higher response times. The response times where the same even hours after we stopped the tests.

To me, it looks like the assertion data from "old" transactions is not being garbage collected. My knowledge is limited here, but this does look like an underlying issue of the Apache cfx library.

Edit: On closer inspection it looks like the map is arbitrary. It is never used other than to check if the code for setting the value should be executed. Therefore it will hold a reference to an object which is never actually retrieved from the map - which means you could achieve the same result by using a set and checking if it contains the key.

However, when inspecting the logs we see that the key is actually never matched. Meaning that caching id's to avoid duplicate registrations is unnecessary (at least in our use case).

FrodeBjerkholt commented 4 years ago

I have now made a hotfix-4.1.10 branch with a fix for this issue. I have also bumped CXF to 3.3.6 and WSS4J 2.2.5 in case they have fixed some problems related to the assertion data. Can you try it out, before I make a release?

FrodeBjerkholt commented 4 years ago

Strange that you always get a new bus. When I am testing, the same bus is reused.

jonassss commented 4 years ago

Hi, this issue was fixed internally some time ago. Looks like i forgot i posted it here.

I don't know how to link issue and PR on github, but i created a pull request here: https://github.com/difi/Oxalis-AS4/pull/129

aaron-kumar commented 3 years ago

Pull request added in Release candidate : https://search.maven.org/search?q=g:network.oxalis%20AND%20a:oxalis-as4 . I am closing this ticket now