RamUsageEstimator.NUM_BYTES_ARRAY_HEADER and other constants are incorrect [LUCENE-3867]

asfimport commented 12 years ago

RamUsageEstimator.NUM_BYTES_ARRAY_HEADER is computed like that: NUM_BYTES_OBJECT_HEADER + NUM_BYTES_INT + NUM_BYTES_OBJECT_REF. The NUM_BYTES_OBJECT_REF part should not be included, at least not according to this page: http://www.javamex.com/tutorials/memory/array_memory_usage.shtml

A single-dimension array is a single object. As expected, the array has the usual object header. However, this object head is 12 bytes to accommodate a four-byte array length. Then comes the actual array data which, as you might expect, consists of the number of elements multiplied by the number of bytes required for one element, depending on its type. The memory usage for one element is 4 bytes for an object reference ...

While on it, I wrote a sizeOf(String) impl, and I wonder how do people feel about including such helper methods in RUE, as static, stateless, methods? It's not perfect, there's some room for improvement I'm sure, here it is:

    /**
     * Computes the approximate size of a String object. Note that if this object
     * is also referenced by another object, you should add
     * {`@link` RamUsageEstimator#NUM_BYTES_OBJECT_REF} to the result of this
     * method.
     */
    public static int sizeOf(String str) {
        return 2 * str.length() + 6 // chars + additional safeness for arrays alignment
                + 3 * RamUsageEstimator.NUM_BYTES_INT // String maintains 3 integers
                + RamUsageEstimator.NUM_BYTES_ARRAY_HEADER // char[] array
                + RamUsageEstimator.NUM_BYTES_OBJECT_HEADER; // String object
    }

If people are not against it, I'd like to also add sizeOf(int[] / byte[] / long[] / double[] ... and String[]).

Migrated from LUCENE-3867 by Shai Erera (@shaie), resolved Mar 23 2012 Attachments: LUCENE-3867.patch (versions: 19), LUCENE-3867-3.x.patch, LUCENE-3867-compressedOops.patch

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Reflection don't need to cost you much if you make a cache along the way. Retrieving an object's class is virtually zero cost so this would make it very efficient and the number of classes in the system is much smaller than the number of objects so it shouldn't be a problem.

to find the maximum field offset we don't need to go to superclasses.

I can't imagine a situation where this wouldn't be the case although an assertion here would be nice just to make sure we're not assuming something that isn't true.

I will take a closer look at the patch again this evening and do some testing/ API flexibility based on what I have in my project. Will report on the results.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Reflection don't need to cost you much if you make a cache along the way. Retrieving an object's class is virtually zero cost so this would make it very efficient and the number of classes in the system is much smaller than the number of objects so it shouldn't be a problem.

Would be like the reflection cache in AttributeSource :-) But yes I was also thinking about a second IdentityHashMap<Class<?>,Long> along the way.

I can't imagine a situation where this wouldn't be the case although an assertion here would be nice just to make sure we're not assuming something that isn't true.

Thats already checked in the test, who has 2 subclasses, one with no additional fields (size must be identical) and one with 2 more fields (should be >=).

asfimport commented 12 years ago

Shai Erera (@shaie) (migrated from JIRA)

Wow, what awesome improvements you guys have added !

Uwe, +1 to commit. I unassigned myself - you and Dawid definitely deserve the credit!

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Modified method naming convention: any sizeOf is "deep", shallowSizeOf* is "shallow". Methods in RUE are now static; didn't hide the constructor though (maybe we should?).

More comments in a minute.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Thanks for cleanup!

didn't hide the constructor though (maybe we should?).

We must. Class is final and has no instance methods -> useless to have ctor. Also as previous versions in 3.x allowed instances, we should prevent this to fix incorrect usage.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I've played with the code a bit and I've been trying to figure out a way to determine empirically "how far off" is the estimation from real life usage. It's not easy because RUE itself allocates memory (and not small quantities in case of complex object graphs!). I left these experiments in StressRamUsageEstimator; it is a test case – maybe we should add @Ignore and rename it to Test*, don't know.

Anyway, the allocation seems to be measured pretty accurately. When tlabs are disabled this is a result of allocating small byte arrays for example:

 committed           max        estimated(allocation)
      2 MB     48.4 MB    16 bytes
    1.7 MB     48.4 MB    262.4 KB
      2 MB     48.4 MB    524.6 KB
    2.2 MB     48.4 MB      787 KB
    2.5 MB     48.4 MB        1 MB
    2.7 MB     48.4 MB      1.3 MB
      3 MB     48.4 MB      1.5 MB
    3.3 MB     48.4 MB      1.8 MB
....
   46.9 MB     48.4 MB     45.6 MB
   47.1 MB     48.4 MB     45.9 MB
   47.4 MB     48.4 MB     46.1 MB
   47.6 MB     48.4 MB     46.4 MB
   47.9 MB     48.4 MB     46.6 MB
   48.1 MB     48.4 MB     46.9 MB

So it's fairly ideal (committed memory is all committed memory so I assume additional data structures, classes, etc. also count in).

Unfortunately it's not always so smooth, for example jrockit's mx beans seem not to return the actual memory allocation state (and if they do, I don't understand it):

 committed           max        estimated(allocation)
   29.4 MB       50 MB    16 bytes
   29.8 MB       50 MB    262.5 KB
   30.2 MB       50 MB    524.9 KB
   30.4 MB       50 MB    787.3 KB
   30.8 MB       50 MB        1 MB
   31.1 MB       50 MB      1.3 MB
   31.4 MB       50 MB      1.5 MB
   31.7 MB       50 MB      1.8 MB
     32 MB       50 MB        2 MB
   32.4 MB       50 MB      2.3 MB
   32.7 MB       50 MB      2.6 MB
   33.1 MB       50 MB      2.8 MB
   33.5 MB       50 MB      3.1 MB
   33.8 MB       50 MB      3.3 MB
   34.2 MB       50 MB      3.6 MB
   34.5 MB       50 MB      3.8 MB
   34.8 MB       50 MB      4.1 MB
   35.2 MB       50 MB      4.4 MB
   35.5 MB       50 MB      4.6 MB
   35.7 MB       50 MB      4.9 MB
   36.2 MB       50 MB      5.1 MB
   36.4 MB       50 MB      5.4 MB
...
   49.6 MB       50 MB     47.6 MB
     50 MB       50 MB     47.9 MB
   49.6 MB       50 MB     48.2 MB
   49.9 MB       50 MB     48.4 MB

A snapshot from 32 bit HotSpot:

...
   25.5 MB     48.4 MB     24.7 MB
   25.7 MB     48.4 MB     24.9 MB
   25.9 MB     48.4 MB     25.1 MB
   26.1 MB     48.4 MB     25.3 MB
   26.3 MB     48.4 MB     25.5 MB
   26.5 MB     48.4 MB     25.7 MB
   26.7 MB     48.4 MB     25.9 MB
   26.8 MB     48.4 MB     26.1 MB
     27 MB     48.4 MB     26.4 MB
   27.2 MB     48.4 MB     26.6 MB
   27.4 MB     48.4 MB     26.8 MB
   27.7 MB     48.4 MB       27 MB
...

I see two problems that remain, but I don't think they're urgent enough to be addressed now:

the stack easily overflows if the graph of objects has long chains. This is demonstrated in the test case (uncomment ignore annotation).
there is a fair amount of memory allocation going on in the RUE itself. If one knows the graph of an object's dependencies is a tree then the memory cost could be decreased to zero (because we wouldn't need to remember which objects we've seen so far).
we could make RUE an object again (resign from static methods) and have a cache of classes and class-fields to avoid reflective accesses over and over. If one performed estimations over and over then such a RUE instance would have an initial cost, but then would be running smoother.

Having said that, I'm +1 for committing this in if you agree with the changes I've made (I will be a pain in the arse about that naming convention discriminating between shallow vs. deep sizeOf though :).

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Oh, one thing that springs to my mind is that we could have an automatically generated class with nested static classes with a random arrangement of all sorts of fields (in all sorts of configurations) and use a similar empirical benchmark to the one I did on small byte arrays but on these objects. This would show if we're estimating object field offsets and sizes correctly.

I wouldn't go into deep object structures though – I've tried this and it's hard to tell what the allocation is and what the overhead/ noise of measurement is.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi,

I am fine with the patch for now, changed in this patch:

Hidden ctor
Cleaned up test to use static import consequently

The stress test is an testcase, but not automatically executed (you have to explicitely do that with -Dtestcase=...). I think thats wanted, right? Otherwise we should rename, but its also noisy and slow.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Final patch: I removed some code duplication and improved exception handling for the reflection while iterating class tree. Simply suppressing is a bad idea, as the resulting size would be underdetermined.

I will commit this later this evening and then backport to 3.x.

asfimport commented 12 years ago

Shai Erera (@shaie) (migrated from JIRA)

Thanks Uwe !

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Javadocs fixes.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Looks good to me, thanks Uwe.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed trunk revision: 1302133

I will now backport with deprecations and add CHANGES.txt later!

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Patch for 3.x including backwards layer (deprecated instances of RUE + string interning support). MemoryModels were nuked completely (will add comment to backwards changes).

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed 3.x revision: 1302152

CHANGES.txt committed in revisions: 1302155 (3.x), 1302156 (trunk)

Thanks Dawid and Shai!

Dawid and I will now look into donating this masterpiece to maybe Apache Commons Lang or similar, as it's of general use.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I've been experimenting a bit with the new code. Field offsets for three classes in a hierarchy with unalignable fields (byte, long combinations at all levels). Note unaligned reordering of byte field in JRockit - nice.

JVM: [JVM: HotSpot, Sun Microsystems Inc., 1.6.0_31] (compressed OOPs)
`@12`  4 Super.superByte
`@16`  8 Super.subLong
`@24`  8 Sub.subLong
`@32`  4 Sub.subByte
`@36`  4 SubSub.subSubByte
`@40`  8 SubSub.subSubLong
`@48`    sizeOf(SubSub.class instance)

JVM: [JVM: HotSpot, Sun Microsystems Inc., 1.6.0_31] (normal OOPs)
`@16`  8 Super.subLong
`@24`  8 Super.superByte
`@32`  8 Sub.subLong
`@40`  8 Sub.subByte
`@48`  8 SubSub.subSubLong
`@56`  8 SubSub.subSubByte
`@64`    sizeOf(SubSub.class instance)

JVM: [JVM: J9, IBM Corporation, 1.6.0]
`@24`  8 Super.subLong
`@32`  4 Super.superByte
`@36`  4 Sub.subByte
`@40`  8 Sub.subLong
`@48`  8 SubSub.subSubLong
`@56`  8 SubSub.subSubByte
`@64`    sizeOf(SubSub.class instance)

JVM: [JVM: JRockit, Oracle Corporation, 1.6.0_26] (64-bit JVM!)
`@ 8`  8 Super.subLong
`@16`  1 Super.superByte
`@17`  7 Sub.subByte
`@24`  8 Sub.subLong
`@32`  8 SubSub.subSubLong
`@40`  8 SubSub.subSubByte
`@48`    sizeOf(SubSub.class instance)

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Thanks for the insight.

When thinking about the reordering, I am a littel bit afraid about the "optimization" in the shallow sizeOf(Class<?>). This optimiaztion does not recurse to superclasses, as it assumes, that all field offsets are greater than those of the superclass, so finding the maximum does not need to recurse up (so it early exits).

This is generally true (also in the above printout), but not guaranteed. E.g. JRockit does it partly (it reuses space inside the superclass area to locate the byte from the subclass). In the above example still the order of fields is always Super-Sub-SubSub, but if the ordeing in the JRockit example would be like:

`@ 8`  1 Super.superByte
`@ 9`  7 Sub.subByte
`@16`  8 Super.subLong
`@24`  8 Sub.subLong
`@32`  8 SubSub.subSubLong
`@40`  8 SubSub.subSubByte
`@48`    sizeOf(SubSub.class instance)

The only thing the JVM cannot change is field offsets between sub classes (so the field offset of the superclass is inherited), but it could happen that new fields are located between super's fields (see above - it's unused space). This would also allow casting and so on (it's unused space in superclass). Unfortunately with that reordering the maximum field offset in the subclass is no longer guaranteed to be greater.

I would suggest that we remove the "optimization" in the shallow class size method. It's too risky in my opinion to underdetermine the size, because the maximum offset in the subclass is < the maximum offset in the superclass.

I hope my explanation was understandable... :-)

Dawid, what do you thing, should we remove the "optimization"? Patch is easy.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I hope my explanation was understandable...

Perfectly well. Yes, I agree, it's possible to fill in the "holes" packing them with fields from subclasses. It would be a nice vm-level optimization in fact!

I'm still experimenting on this code and cleaning/ adding javadocs – I'll patch this and provide a complete patch once I'm done, ok?

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

OK. All you have to remove is the if (fieldFound && useUnsafe) check and always recurse. fieldFound itsself can also be removed.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

JRockit could even compress like this, it would still allow casting as all holes are solely used by one sub-class:

`@ 8`  1 Super.superByte
`@ 9`  1 Sub.subByte
`@10`  6 SubSub.subSubByte
`@16`  8 Super.subLong
`@24`  8 Sub.subLong
`@32`  8 SubSub.subSubLong
`@40`    sizeOf(SubSub.class instance)

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Maybe it does such things already. I didn't check extensively.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

We have to remove the shallow size optimization in 3.x and trunk.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I confirmed that this packing indeed takes place. Wrote a pseudo-random test with lots of classes and fields. Here's an offender on J9 for example (Wild{inheritance-level}{field-number}):

`@24`  4 Wild_0_92.fld_0_0_92
`@28`  4 Wild_0_92.fld_1_0_92
`@32`  4 Wild_0_92.fld_2_0_92
`@36`  4 Wild_0_92.fld_3_0_92
`@40`  4 Wild_0_92.fld_4_0_92
`@44`  4 Wild_0_92.fld_5_0_92
`@48`  4 Wild_0_92.fld_6_0_92
`@52`  4 Wild_2_5.fld_0_2_5
`@56`  8 Wild_1_85.fld_0_1_85
`@64`  8 Wild_1_85.fld_1_1_85
`@72`    sizeOf(Wild_2_5 instance)

HotSpot and JRockit don't seem to do this (at least it didn't fail on the example).

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Thanks, in that case shallowSizeOf(Wild_2_5.class) would incorrectly return 56 because of the short-circuit - so let's fix this.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Yep, that assumption was wrong – indeed:

WildClasses.Wild_2_5 wc = new WildClasses.Wild_2_5();
wc.fld_6_0_92 = 0x1122;
wc.fld_0_2_5 = Float.intBitsToFloat(0xa1a2a3a4);
wc.fld_0_1_85 = Double.longBitsToDouble(0xb1b2b3b4b5b6b7L);
System.out.println(ExpMemoryDumper.dumpObjectMem(wc));

results in:

0x0000 b0 3d 6f 01 00 00 00 00 0e 80 79 01 00 00 00 00
0x0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0030 22 11 00 00 a4 a3 a2 a1 b7 b6 b5 b4 b3 b2 b1 00
0x0040 00 00 00 00 00 00 00 00

And you can see they are reordered and longs are aligned.

I'll provide a cumulative patch of changes in the evening, there's one more thing I wanted to add (cache of fields) because this affects processing speed.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Ok, I admit J9 is fascinating... ;) How much memory does this take?

class X {
  byte a = 0x11;
  byte b = 0x22;
}

Here is the memory layout:

[JVM: IBM J9 VM, 2.6, IBM Corporation, IBM Corporation, 1.7.0]
0x0000 00 b8 21 c4 5f 7f 00 00 00 00 00 00 00 00 00 00
0x0010 11 00 00 00 22 00 00 00
`@16`  4 Super.b1
`@20`  4 Super.b2
`@24`    sizeOf(Super instance)

I don't think I screwed up anything. It really is 4 byte alignment on all fields.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Don't be scared by the size of this patch – it contains a lot of generated code in WildClasses.

Improvements:

size estimation is not recursive (which led to stack overflows quite easily on more complex object graphs),
decreased memory consumption by using a custom impl. of an identity object set.
added a cache of resolved class information (ref. fields, shallow size).
removed the optimization of counting only subclass field offsets because fields can be packed (J9).
added more verbose information about unsupported JVM features. J9 doesn't have the MX bean for example (and does dump this).

The above changes also speed up the entire processing.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Added a test case for identity has set, removed constants, removed wild classes.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I think the patch is now fine! I will commit it later and backport to 3.x.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Thanks Uwe. I'll be working in the evening again but if you're faster go ahead and commit it in.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed trunk revision: 1304485, 1304513, 1304564 Committed 3.x revision: 1304565

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I've been thinking how one can assess the estimation quality of the new code. I came up with this:

I allocate an Object[] half the size of estimated maximum available RAM (just to make sure all objects will fit without the need to reallocate),
I precompute shallow sizes for instances of all "wild classes" (classes with random fields, including arrays).
I then fill in the "vault" array above with random instances of wild classes, summing up the estimated size UNTIL I HIT OOM.
Once I git OOM I know how much we actually allocated vs. how much space we thought we did allocate.

The results are very accurate on HotSpot if one is using serial GC. For example:

[JVM: Java HotSpot(TM) 64-Bit Server VM, 20.4-b02, Sun Microsystems Inc., Sun Microsystems Inc., 1.6.0_29]
Max: 483.4 MB, Used: 698.9 KB, Committed: 123.8 MB
Expected free: 240.9 MB, Allocated estimation: 240.8 MB, Difference: -0.05% (113.6 KB)

If one runs with a parallel GC things do get out of hand because the GC is not keeping up with allocations (although I'm not sure how I should interpret this because we only allocate; it's not possible to free any space – maybe there are different GC pools or something):

[JVM: Java HotSpot(TM) 64-Bit Server VM, 20.4-b02, Sun Microsystems Inc., Sun Microsystems Inc., 1.6.0_29]
Max: 444.5 MB, Used: 655.4 KB, Committed: 122.7 MB
Expected free: 221.5 MB, Allocated estimation: 174.2 MB, Difference: -21.34% (47.3 MB)

JRockit:

[JVM: Oracle JRockit(R), R28.1.4-7-144370-1.6.0_26-20110617-2130-windows-x86_64, Oracle Corporation, Oracle Corporation, 1.6.0_26]
Max: 500 MB, Used: 3.5 MB, Committed: 64 MB
Expected free: 247.7 MB, Allocated estimation: 249.5 MB, Difference: 0.74% (1.8 MB)

I think we're good. If somebody wishes to experiment, the spike is here: https://github.com/dweiss/java-sizeof

mvn test
mvn dependency:copy-dependencies
java -cp target\classes:target\test-classes:target\dependency\junit-4.10.jar \
  com.carrotsearch.sizeof.TestEstimationQuality

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

For historical records: the previous implementation of RamUsageEstimator was off by anything between 3% (random size objects, including arrays) to 20% (objects smaller than 80 bytes). Again – these are "perfect scenario" measurements with empty heap and max. allocation until OOM, with a serial GC. With a concurrent and parallel GCs the memory consumption estimation is still accurate but it's nearly impossible to tell when an OOM will occur or how the GC will manage the heap space.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

That's true. But you can still get the "unreleaseable allocation", so the size of the non-gc-able object graph. If GC does not free the objects after release fast-enough, it will still do it once memory gets low. But the allocated objects with hard refs are not releaseable.

So I think it's fine for memory requirement purposes. If you want real heap allocation, you must use instrumentation.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I didn't say it's wrong – it is fine and accurate. What I'm saying is that it's not really suitable for predictions; for answering questions like: how many objects of a given type/ types can I allocate before an OOM hits me? It doesn't really surprise me that much, but it would be nice. For measuring already allocated stuff it's more than fine of course.

apache / lucene

RamUsageEstimator.NUM_BYTES_ARRAY_HEADER and other constants are incorrect [LUCENE-3867] #4940