Open ghost opened 9 years ago
It happened today again
I'm looking at the catalina log. This morning's problem may have been due to last night's OutOfMemoryError. I thought the GC just runs in a single thread, so I'm not sure why I saw CPU usage in the 1000%s, which indicates multiple CPU being utilized.
I don't see a similar memory error from yesterday. However, I suppose it's possible that the exception may not get logged since that's a critical issue.
This issue does seem related to memory (even though at first it seemed like a leak of CPU utilization). Java's default max heap size on that machine is over 20G, and when I saw the problem, 26.5% of system memory was being used (around the full size of the heap). So the heap probably filled up. The GC was probably taking over trying to free memory (again, not sure why CPU usage was in the 1000%s).
Next time this happens let's do a heap dump (jmap) to see what's leaking memory. It takes awhile to dump the full heap, but it should help us debug this problem.
Usually when I see an OOM error, I fix it by increasing max heap size (assuming no leak). In this case there seems to be a leak.
Here's the OutOfMemoryError from the catalina logs:
23:05:09.855 [4010] INFO edu.pitt.sis.exp.colfusion.dal.databaseHandlers.DatabaseHandlerBase - Query 'SELECT Country
FROM Mortality_Bombay,India_1916- 21785378181
' resulted in 8478 rows
Exception in thread "4009" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.HashMap.resize(HashMap.java:703)
at java.util.HashMap.putVal(HashMap.java:628)
at java.util.HashMap.put(HashMap.java:611)
at edu.pitt.sis.exp.colfusion.dal.dataModels.tableDataModel.Row.getValuesFromCids(Row.java:57)
at edu.pitt.sis.exp.colfusion.dal.dataModels.tableDataModel.Row.getByTransformation(Row.java:43)
at edu.pitt.sis.exp.colfusion.bll.similarityJoins.NestedLoopSimilarityJoin.join(NestedLoopSimilarityJoin.java:58)
at edu.pitt.sis.exp.colfusion.bll.relationships.ColumnToColumnDataMatchingProcess.execute(ColumnToColumnDataMatchingProcess.java:153)
at edu.pitt.sis.exp.colfusion.bll.process.ProcessBase.run(ProcessBase.java:174)
at java.lang.Thread.run(Thread.java:745)
03:51:27.899 [3985] INFO edu.pitt.sis.exp.colfusion.bll.process.ProcessManager - About to the delete process 3985 from the list of running processes beca
use the process is failed. Actual status of the process is false
"I suppose it's possible that the exception may not get logged since that's a critical issue"
By that I mean that it's critical enough that everything might break, including logging.
"It takes awhile to dump the full heap"
Generally it doesn't, but I'm assuming it will be over 20G, in which case I imagine it will take more than a few minutes.
Meanwhile I will try to look at the code and see if I can spot a problem
Last time I encountered something similar, it was due to a library retaining a reference to some object that the caller did not think it would retain a reference to. More specifically, we were using the library to operate on many strings, and every string we gave it was being added to some global hash table, so eventually the heap would fill up.
I went to the Colfusion site and it was very slow. I couldn't get the the "Published Data" page.
top reported that the Tomcat Java process was using 2044% CPU and 26.5% memory.
I tried debugging while the problem existed, but couldn't figure it out. Using strace didn't reveal an excess of system calls (just a single futex call). jstack doesn't show anything unusual. It looks like most threads were blocked, so I'm not sure what was consuming the CPU. The stack dump is attached.
I restarted tomcat to fix the problem. tomcat-stack-trace-2015-10-27.txt