enonic / lib-cache

Cache Library for Enonic XP.
Apache License 2.0
2 stars 1 forks source link

lib-cache uses up all physical memory on server. #3

Closed drerik closed 6 years ago

drerik commented 6 years ago

We have a lot of installations that "eats" up native memory while heap mem usage is not increasing. This makes the os kill the jvm on memory allocation or other processes that allocates memory is not allowed to start.

For all the installations of customer1, the problem seems to be related to traffic as their test server does not see the same issue. From what i know they are using lib-cache to speed up page generation/viewing.

But on the customer2-prod and customer2-test installations we see that the memmory is incresing every hour on the hour. And also on the test environment which do not have any visiting traffic. According to the partner that wrote the code for customer2 they import data every hour on the hour and store it in a cache object for 24 hours.

One thing I have learnd is that java.nio.ByteBuffers.allocateDirect() puts objects outside of the normal heap space. From https://docs.oracle.com/javase/8/docs/api/java/nio/ByteBuffer.html :

... The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance...

From a java heap dump I also found references to ByteBuffer[]. mat_bytebuffer_1 mat_bytebuffer_2

Enonic XP log from when the crontjob:

12:00:00.002 INFO  c.e.app.cronjob.runner.JobRunnerImpl - Executing job [no.customer.project:customersync]
12:00:00.003 INFO  no.customer.project - (/jobs/customersync.js) Starting sync from [system]
12:00:02.994 INFO  no.customer.project - (/lib/console.js) Done syncing projects:  {
  "status": 200,
  "body": {
    "syncedNorwegianProjects": {
      "changes": [],
      "hiddenProjects": [],
      "updatedProjects": [],
      "newProjects": []
    },
    "syncedEnglishProjects": {
      "changes": [],
      "hiddenProjects": [],
      "updatedProjects": [],
      "newProjects": []
    }
  }
}

12:00:10.915 INFO  no.customer.project - (/lib/console.js) Done syncing profiles:  {
  "status": 200,
  "body": {
    "syncedNorwegianProfiles": {
      "changes": [],
      "retiredProfiles": [],
      "updatedProfiles": [],
      "newProfiles": []
    },
    "syncedEnglishProfiles": {
      "changes": [],
      "retiredProfiles": [],
      "updatedProfiles": [],
      "newProfiles": []
    }
  }
}

12:00:10.915 INFO  c.e.app.cronjob.runner.JobRunnerImpl - Executed job [no.customer.project:customersync] in 10913 ms

the code that runs this task will be given on request.

Memory usage on server and in the jvm: os_mem_usage xp_heap_mem

Snapshot of grafana data: https://metrics.enonic.io/dashboard/snapshot/BVkDQgqBsotba6TU5VdTpv1F1VP8YmIE

sigdestad commented 6 years ago

So. Does this mean that lib cache is used incorrectly (never cleaned)? Or is this just a natural behaviour and the memory will be cleaned by the OS?

runarmyklebust commented 6 years ago

Lets verify this by creating a test setup. Should be quite easy to verify.

alansemenov commented 6 years ago

@drerik says it's critical, so I'm adding it to the sprint and assigning to @runarmyklebust

runarmyklebust commented 6 years ago

I cannot find any indications that this is related to the lib-cache, not by code-review or testing with lots of data. There may ofc be something I dont manage to replicate, but I think the problem lies elsewhere.

sigdestad commented 6 years ago

Makes sense, we'll have to profile this to get better insight then!

alansemenov commented 6 years ago

@runarmyklebust should we close this if it's not lib-cache that is an issue?

runarmyklebust commented 6 years ago

Yes