Portal crash (server side)

danfruehauf commented 10 years ago

catalina.out and hprof are in /home/dfruehauf/crash-20140612 on 1-nsp-mel.

danfruehauf commented 10 years ago

I would also tag this as critical, but up to you, @pblain.

julian1 commented 10 years ago

@dnahodil @jonescc @danfruehauf @pblain @pmbohm @anguss00

I've investigated, but would be interested in other's opinions on how to proceed next,

Log shows java.lang.OutOfMemoryError: Java heap space Issue beginning at 2014-06-11 20:59:01,625, causing multiple subsequent exceptions and other issues, eventually leading to hung app.

The hprof which is a snapshot of memory use at the time of the crash was opened with Eclipse Memory Analyzer (MAT). Examining the 'dominator tree' and other views, shows several problems related to scanner actions:

3x 105Mb org.apache.Coyote.Request objects+ other 10Mb objects. All relate to ncwms and appear to be the results from imos and csiro ncwms servers - the result of wms get_capabilities requests being posted to portal by wms scanner.
1x 356Mb json undecoded/raw parameters java string. Aliased from gc root from multiple objects. Appears to be the json result of wfs scanner posting data values from anmn_burst_avg_timeseries_data layer.
1x 842Mb json hash map object. It's a tree of Groovy hash map objects, which means we loose normal introspection ability - however they appear to be data values for different layer columns. eg. 975000 values for TEMP_burst_sd. These are probably all Gorm/ORM objects abstracting database access.

I believe that it's a combination of these memory cycle issues that lead to the hung portal. No leak by itself is sufficient to trigger. Also appears to be unrelated to any aggregation operation.

Possible solutions,

Bumping memory available to the portal.
All memory issues relate to wms/wfs scanner actions. Deprecating scanners and moving to stateless portal will obviously resolve.
Look at each specific 'leak' and consider if there may be fixes or workarounds. For example if portal doest require filtering by value/range (?), we ought to avoid including data in http web-requests and storing to the portal database. Issue - should this be treated as a bug or development?

pblain commented 10 years ago

The most efficient use of our time would be to use option 1 (if it works) as an interim measure until we implement option 2.

danfruehauf commented 10 years ago

@pblain This just means portal will crash less frequently, say... every 4 weeks instead of 2.

And this still may depend on usage.

dnahodil commented 10 years ago

I'd agree that given we are actively working to get rid of the WMS and WFS scanners that we should just throw some more memory at it. I don't think it's worth spending much time trying to fix the memory usage if that code is going to get pulled out soon anyway.

pblain commented 10 years ago

Our current perceived uptime is only about 75%. It would be a different story if the NSP was more reliable, etc, but it's not going to make much of a difference to our uptime if we get a couple more portal freezes before we eliminate the scanners.

julian1 commented 10 years ago

Pull request to increase memory here, https://github.com/aodn/chef/pull/834

aodn / aodn-portal

Portal crash (server side) #1166