Closed danfruehauf closed 10 years ago
I would also tag this as critical, but up to you, @pblain.
@dnahodil @jonescc @danfruehauf @pblain @pmbohm @anguss00
I've investigated, but would be interested in other's opinions on how to proceed next,
Log shows java.lang.OutOfMemoryError: Java heap space Issue beginning at 2014-06-11 20:59:01,625, causing multiple subsequent exceptions and other issues, eventually leading to hung app.
The hprof which is a snapshot of memory use at the time of the crash was opened with Eclipse Memory Analyzer (MAT). Examining the 'dominator tree' and other views, shows several problems related to scanner actions:
I believe that it's a combination of these memory cycle issues that lead to the hung portal. No leak by itself is sufficient to trigger. Also appears to be unrelated to any aggregation operation.
Possible solutions,
The most efficient use of our time would be to use option 1 (if it works) as an interim measure until we implement option 2.
@pblain This just means portal will crash less frequently, say... every 4 weeks instead of 2.
And this still may depend on usage.
I'd agree that given we are actively working to get rid of the WMS and WFS scanners that we should just throw some more memory at it. I don't think it's worth spending much time trying to fix the memory usage if that code is going to get pulled out soon anyway.
Our current perceived uptime is only about 75%. It would be a different story if the NSP was more reliable, etc, but it's not going to make much of a difference to our uptime if we get a couple more portal freezes before we eliminate the scanners.
Pull request to increase memory here, https://github.com/aodn/chef/pull/834
catalina.out and hprof are in
/home/dfruehauf/crash-20140612
on 1-nsp-mel.