RWS / dxa-web-application-java

SDL Digital Experience Accelerator Java Spring MVC web application
25 stars 37 forks source link

Java DXA 2.2 OutOfMemory Error #134

Closed NicholasW-cb closed 3 years ago

NicholasW-cb commented 3 years ago

We have encountered an issue in our production DXA 2.2 webapplications where they slowly consume more memory over time until they eventually run into an OutOfMemoryError. At this time we don’t see anything in the webapp’s logs (since everything is functioning normally right up until the outage) but when it runs out of memory, our webserver outputs the following: [Servlet Error]-[com.nrg.dxawebapp.corefeatures.service.MBDispatcherServlet]: java.lang.OutOfMemoryError: Java heap space

When checking our heapdumps, we found that the issue is occurring in an instance of ObjectMapper which contains over 400,000 entries (~130MB).

While debugging this issue, we found that the ObjectMapper grows slowly over the course of a few weeks until it causes the application to run out of memory and become unresponsive which results in a production outage.

We dug deeper and we found that the issue is occurring on a static member of the com.sdl.web.pca.client.DefaultApiClient class which is named MAPPER. This static variable is the one that grows. It appears as though every time the DefaultApiClient constructor is called, it increases the size of the MAPPER object. Our DXA Webapps call this constructor many times through the com.sdl.dxa.tridion.pcaclient.DefaultApiClientProvider class which is used by many other DXA classes to perform their API requests (such as: com.nrg.dxawebapp.common.impl.DefaultNrgLocalizationResolver).

A full write-up of the details we found are below. Including a quick walkthrough of the code that we traced and debugged in order to identify this issue: An example of a request to the Services made from the DXA webapp is:

com.nrg.dxawebapp.common.impl.DefaultNrgLocalizationResolver

On the DXA side, classes like this make requests through the

com.sdl.dxa.tridion.pcaclient.DefaultApiClientProvider

Class. In that class, the getClient() method contains code where it creates new

ApiClient client = new DefaultApiClient(graphQLClient, requestTimeout);

Objects for each method call. If you decompile this com.sdl.web.pca.client.DefaultApiClient class you will see that it has a static member named MAPPER and the constructor creates a SimpleModule that is then used to modify the MAPPER using the following code: MAPPER.registerModule(module); If you check the com.fasterxml.jackson.databind.ObjectMapper class, you will see that registerModule method in theory should not allow duplicate modules to be registered. It has the following code to check for this:

if (isEnabled(MapperFeature.IGNORE_DUPLICATE_MODULE_REGISTRATIONS)) { If you check the MapperFeature class, you will see that IGNORE_DUPLICATE_MODULE_REGISTRATIONS(true), is enabled by default and therefore, the ObjectMapper class should be ignoring this duplicate registration. However, if you look closer at the ObjectMapper class, you can see that it only ignores duplicate registrations if the type of the module is not null:

if (isEnabled(MapperFeature.IGNORE_DUPLICATE_MODULE_REGISTRATIONS)) {
    Object typeId = module.getTypeId();
    if (typeId != null) {

However, the module that is created in the DefaultApiClient constructor is a com.fasterxml.jackson.databind.module.SimpleModule If you look into the code of this object, you can see that the getTypeId() method always returns null for SimpleModule objects. This means that duplicates are not ignored and are instead registered on the webapp. Therefore, the static MAPPER object (which is a member of the DefaultApiClient class) will continue to grow each time DXA's com.sdl.dxa.tridion.pcaclient.DefaultApiClientProvider class creates a new DefaultApiClient object.

The MAPPER object grows each time DXA reaches out to the CD services via the GraphQL Api to get things like: Localizations, pages, components, etc... (We confirmed the size of this object grows while debugging a running application.) Which means, after a DXA webapp has been running for a while and served a large number of requests, it will eventually run out of memory, throw an exception, and the webapp will stop responding to requests, causing an outage.

We reached out to SDL support for a fix to the underlying PCA API and they requested that we also open an issue here in case it can expedite a resolution.

alebastrov commented 3 years ago

fixed