Long-duration server metrics fail to work as expected

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Run DT with increased-duration server metrics. I used
java -jar rbnb.jar -m ,,1000000  
2. Let it run for a day or so
3. Attempt to view 1 days' worth of metrics in RDV.

What is the expected output? What do you see instead?
I expect a plot of metrics. It fails (screenshot attached) and the server log 
shows this:

<23-May-2008 PDT 16:21:14.020> <RDVMetadata@dyn137-110-115-182.ucsd.edu>
   Started for sink running V3.0 build 2774 from Fri Jun 29 06:55:09 PDT 2007.
OutOfMemoryError servicing request.  Recovering...
Exception in thread "Timer-0" java.lang.OutOfMemoryError: Java heap space
        at java.util.Vector.<init>(Vector.java:111)
        at java.util.Vector.<init>(Vector.java:124)
        at java.util.Vector.<init>(Vector.java:133)
        at com.rbnb.api.ThreadWithLocks.<init>(ThreadWithLocks.java:55)
        at com.rbnb.api.TimerTask$TimerDaemon.<init>(TimerTask.java:850)
        at com.rbnb.api.TimerTask.run(TimerTask.java:439)
        at com.rbnb.api.IndirectTimerTask.run(IndirectTimerTask.java:135)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
Exception in thread "_SRH.RDV@dyn137-110-115-182.ucsd.edu" 
java.lang.OutOfMemoryError: 
Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2734)
        at java.util.Vector.clone(Vector.java:627)
        at com.rbnb.api.StreamListener.isAlive(StreamListener.java:1493)
        at com.rbnb.api.StreamRequestHandler.run(StreamRequestHandler.java:591)
        at java.lang.Thread.run(Thread.java:637)
<23-May-2008 PDT 16:21:43.306> <RDV@dyn137-110-115-182.ucsd.edu>
   java.lang.IllegalStateException: Insufficient memory to service request.
        at com.rbnb.api.StreamRBOListener.processWorking(StreamRBOListener.java:1278)
        at com.rbnb.api.StreamRBOListener.run(StreamRBOListener.java:1571)
        at java.lang.Thread.run(Thread.java:637)
<23-May-2008 PDT 16:21:43.312> <RCO RDV@dyn137-110-115-182.ucsd.edu 
(com.rbnb.api.TCPRCO@3a580bb3)>
   java.lang.OutOfMemoryError: Java heap space
   java.lang.Exception: Traceback (may not reflect location of error)
        at com.rbnb.api.Log.addError(Log.java:463)
        at com.rbnb.api.RCO.run(RCO.java:2293)
        at java.lang.Thread.run(Thread.java:637)

Original issue reported on code.google.com by hubb...@sdsc.edu on 23 May 2008 at 11:29

Attachments:

[Picture 5.png](https://storage.googleapis.com/google-code-attachments/dataturbine/issue-5/comment-0/Picture 5.png)

GoogleCodeExporter commented 9 years ago

Looks like requesting a day's worth of metrics ran the server out of memory.  
Did you
specify the java heap space at startup?  Suggest you run the test again but 
specify
more memory at startup; for instance:

java -Xmx1024m rbnb.jar

Original comment by john.wil...@erigo.com on 27 May 2008 at 12:34

GoogleCodeExporter commented 9 years ago

Trying this now with 3.1b4a and 512M of memory.

Original comment by phubb...@gmail.com on 30 May 2008 at 5:13

GoogleCodeExporter commented 9 years ago

Paul,

Has this been resolved to your satisfaction?

Bill

Original comment by enfield....@gmail.com on 22 Aug 2008 at 8:16

GoogleCodeExporter commented 9 years ago

I am stilling looking into this.  Sorry for the delay, I will make time before 
the
end of this month (8/2008).

Original comment by millermj1@comcast.net on 22 Aug 2008 at 9:01

GoogleCodeExporter commented 9 years ago

I have researched and characterized this problem, and implemented a partial 
solution.

The cause of this problem is due to built-in RBNB metrics are (very) memory
inefficient. A request for a days worth (86400 points per channel at 1 Hz) of 
the 6
metrics channels takes an (amazing) ~600MB of internal memory to process.  Thus 
the
out of memory exception as reported.

There seems to be several reasons for this, which I now understand to varying
extents. One reason which I fixed, for about a factor of 4 reduction in memory 
use,
is that every metrics data point was being stored with both a timestamp plus a
"duration" to specify the interval between it and the prior point.  I changed 
this to
use zero-durations (for an implied interval), which saves not only in-memory 
space
but archive space as well.  This seems to work fine, but please test it.

There are other things that could be done to further reduce memory storage.  One
would be to look at storing multiple metrics points per data frame.  Another 
would be
to get better timestamp compression, i.e. store time intervals over many 
points. 
There are probably other inefficiencies to be discovered as well.

Operationally, another way to reduce memory requirements would be to use a 
metrics
sampling interval of greater than the default 1 second.  E.g. store metrics 
once per
10 seconds, or once per 60 seconds.  This would have a 10x or 60x memory savings
respectively.  The call to do this would be to add an interval spec to the 
metrics
arguments, e.g. java -jar -m10,,1000000

Thus, with the code change plus a metrics interval increase from 1 to 10 
seconds, a
40x reduction in memory would be realized.  The reported failure case would 
drop from
about 600MB to 15MB of memory used, and thus most likely work fine.

Original comment by millermj1@comcast.net on 27 Aug 2008 at 2:53

GoogleCodeExporter commented 9 years ago

Fix seems to be working and solving the problem.

Original comment by phubb...@gmail.com on 4 Oct 2008 at 8:38

Changed state: Verified

TenKeyAngle / dataturbine

Long-duration server metrics fail to work as expected #5