Closed GoogleCodeExporter closed 9 years ago
Benchmark attached here...
Original comment by travis.d...@gmail.com
on 27 Jan 2013 at 5:45
Attachments:
You can uncomment the block above setUp(), and comment out the corresponding
block right above that to try the List[] case (second set of results above).
Original comment by travis.d...@gmail.com
on 27 Jan 2013 at 5:46
Wow, thanks. Coincidentally, we recently started some internal discussions
about the value of methods like EmptyImmutableList.isEmpty(). Your benchmark
numbers are beyond my worst fears (20x??). Further, your reasoning shows that a
solution would have to do more than just remove method overrides: It would
actually have to remove implementation classes. And, as you point out, there
are more of these than Empty+Singleton+Regular. (See also *Immutable*AsList.)
For this reason, List in particular may end up being a lost cause, but this
deserves more thought. Maybe we can hide all the "weird" implementations behind
a ForwardingImmutableList that delegates to
non-ImmutableList-but-actually-immutable delegates? Or maybe we should focus
our attention on immutable classes with fewer implementations
(ImmutableMultimap)?
The distressing part of this is that the primary goal of the multiple "normal"
ImmutableList implementations is to save memory, and there's only so much we
can do to preserve the memory savings while eliminating subclasses. I like your
suggestion of eliminating EmptyImmutableList (with the caveat that the
existence of other ImmutableList implementations may still be a megamorphism
problem). Another way of doing that is to use a RegularImmutableList for empty
lists.
The overall size-space tradeoff is a hard one to evaluate, so I don't know
exactly what we'll do here, but SingletonImmutableList, for one, looks tenuous.
Thanks for supplying the numbers that have been lacking in our evaluation so
far.
Original comment by cpov...@google.com
on 28 Jan 2013 at 4:38
Please try to @Param numLists with other values too. This show the same
tendency, but higher numLists show less and less difference compared to 100.
Also, the singleton list may be an actual gain of performance with higher
values of numLists.
I actually don't know how to interpret that, so I'm just giving a lead here.
Original comment by ogregoire
on 28 Jan 2013 at 5:12
ImmutableList does seem likely to always have many subclasses -- mostly the
ImmutableAsLists.
Chris, I'm not sure I follow how ForwardingImmutableList would work, or would
save anything?
I think we could safely eliminate EmptyImmutableList and its brethren -- as
long as we maintain a singleton holding the empty ImmutableList instance, it
will only require a constant amount of memory for the whole VM.
That said, I'm not 100% convinced the benchmark is valid, if only because the
List interface itself is likely to have more implementations used than just
ArrayList and ImmutableList...? Most implementations will be List references,
not ArrayList-specific references, so I'd expect most operations on a basic
List to be polymorphic calls.
Original comment by lowas...@google.com
on 28 Jan 2013 at 5:53
> Chris, I'm not sure I follow how ForwardingImmutableList would work, or
> would save anything?
We can sacrifice performance on uncommon implementations by making the common
call bimorphic:
ImmutableList has two subclasses, RegularImmutableList and
ForwardingImmutableList. RegularImmutableList is used for 0-, many-, and
perhaps 1-element lists. ForwardingImmutableList is used for everything else,
including ImmutableAsList. ForwardingImmutableList is a wrapper around a plain
(though immutable) List. Any other "ImmutableList implementations" are actually
List implementations that get wrapped in a ForwardingImmutableList.
> That said, I'm not 100% convinced the benchmark is valid, if only because
> the List interface itself is likely to have more implementations used than
> just ArrayList and ImmutableList...? Most implementations will be List
> references, not ArrayList-specific references, so I'd expect most operations
> on a basic List to be polymorphic calls.
While we have been encouraging people to store references in fields of type
ImmutableList, we can be sure that that advice isn't always followed. In any
case, it would be nice if the people who did follow the advice could get a
performance boost out of it.
But there may be only so much gain from any of these approaches. It may be that
the right takeaway from all this is that our attempts to subclass ImmutableList
for speed are fruitless because, even if we could eliminate the entire 90ns
cost of "the method itself," the percent performance increase would be in the
single digits. Despite years of advice that the immutable implementations are
probably more efficient than the JDK classes, we've avoided writing these
benchmarks. (Well, at least we have some for Set.)
Original comment by cpov...@google.com
on 28 Jan 2013 at 6:28
To be clear, I'm arguing that the benchmark *overestimates* the speed of
ArrayList. I also have to admit that I'd've assumed that if the
ForwardingImmutableList procedure you describe worked, then the JIT would
already have implemented a similar technique to deal with polymorphic calls
when certain implementations were much more common than others.
Original comment by lowas...@google.com
on 28 Jan 2013 at 6:34
Given the discussion so far, I think there is a bit of a misunderstanding about
what makes things slow, so here's a quick summary based on the comments above,
and then the gory details follow in the next post.
> Most implementations will be List
> references, not ArrayList-specific references, so I'd expect most operations
> on a basic List to be polymorphic calls.
The important thing for performance isn't whether the "proven" type of the
reference has subclasses, but whether multiple implementations have actually
been seen in practice _at some call site_. In the language of my post to
follow, CHA doesn't apply if List has multiple implementations, but "inline
caching" does, as long as only one implementation is seen in practice, for each
call site. That's why ArrayList is fast: one class returned by
Lists.newArrayList(), and ImmutableList is slow: multiple implementations
returned in practice (for minSize in [0, 1] - but the 2 implementation case is
still optimized well).
This is shown clearly by the second benchmark - the type is now List<>, but
ArrayList and ImmutableList (for length > 1) are still very fast (a 0.3 ns
penalty per call, since CHA isn't in play). The massive penalty incurred by
the > 1 implementations of ImmList is still being paid, and the relative cost
is even greater (20x now vs 10x) since interface types are being used).
Declaring fields as ImmutableList vs List actually does give a performance
boost, but only because ImmutableList is a class, and not an interface, so it
can use the faster vtable dispatch vs the interface dispatch. That's secondary
to the issue here - even virtual dispatch is very slow compared to inlined &
fully optimized access.
> Maybe we can hide all the "weird" implementations behind a
ForwardingImmutableList
> that delegates to non-ImmutableList-but-actually-immutable delegates? Or
maybe we
> should focus our attention on immutable classes with fewer implementations
(ImmutableMultimap)?
As above, the issue isn't total implementations, but total implementations
likely to be seen at one call side. I focused on the types returned by
ImmutableList.of() because three different implementations can be transparently
returned by .of(), completely undocumented, so you can get different concrete
classes even if you did your best to avoid it. Such instances are very likely
to end up at the same call site, and cause poor performance, compared to the
other implementations, which will often never appear at any call site (see the
next part, for an important clarification about "call site").
The other implementations, while common in some cases, only arise as the result
of distinct java call sequences, so are likely to be limited to many fewer
_effective_ call sites (the difference between java-level call sites and
effective call sites after inlining is critical, so see the last part of the
next post).
Original comment by travis.d...@gmail.com
on 30 Jan 2013 at 6:17
Here are the gory details about JIT method inlining & optimization in general.
When I say "JIT" here, I mean hotspot, but I don't think it is going to vary
too much across other state-of-the-art JITs out there - or at least they'll
usually be worse in specific cases, not better. This should help you when you
are analyzing the tradeoffs to make.
Unlike say, C++, every non-final instance method in Java is potentially virtual
- that is, polymorphic dispatch is the default. Naively implemented, these
calls are brutal for performance - partly because the dispatch requires some
indirection but *mostly* because they aren't inlined. Once a method isn't
inlined you lose all sorts of optimizations involving the caller and callee, so
it is really the "granddaddy" optimization that enables the rest of the
optimization family to kick in for hot code with method calls. It's worth
noting that the fast cases for my benchmark take between 0.6 ns and 0.9 ns per
call, which is exactly 2 or 3 cycles (3.33 GHz box), which is pretty amazing
given that there are a dozen bytecodes or more involved in each loop. The JIT
has done a good job!
Because inlining methods is so important and virtual methods are so important,
a key optimization for the JIT is "devirtualization" - taking a method which is
polymorphic in theory and compiling it as if the actual implementation was
known, including inlining it. It does this through two very different
mechanisms, with different behaviors and different implications for class
hierarchy design, which I will describe below:
Class Hierarchy Analysis (CHA):
This is the trick that makes the final on methods keyword mostly useless for
performance, despite lots of out of date advice to the contrary (still useful
as a enforced restriction at compile time). If a method m() is declared final
on class C, and the compiler *knows* that for an object of type C or a subclass
(but not a superclass), a call to m() will always resolve to the single final
implementation, and it can be fully inlined and optimized.
Most methods, however, are not declared final, but are _effectively_ final
since they have no overrides in the currently loaded set of classes. Since the
JVM knows, at all times, the complete set of loaded classes, it can perform
this "is effectively final" analysis for each method. If m() has no overrides
in any loaded class, it can be optimized just as if it was final.
The set of loaded classes can change of course, so some CHA-based assumptions
which where valid when some code was compiled may become invalid later. This
is handled through deoptimization when a class is loaded that violates earlier
CHA assumptions. The deoptimization always occurs at a safepoint, so it
piggy-backs on the existing safepoint polling mechanisms and basically means
that CHA is "free" from the point of view of the compiled code - there is no
trace of the fact that the CHA could become invalid in the generated assembly,
other than safepoint polling which is required regardless.
OK, so the key points about CHA based devirtualization:
1) It applies to the entire VM, per method - either some method can be
devirtualized via CHA everywhere, or nowhere (actually it could be
per-classloader, but the general idea that it's a global optimization is the
same).
2) As a consequence of (1), if any class is loaded that overrides a method, CHA
for that method is invalid everywhere a method is called, even if some call
site never actually sees the overridden classes.
3) CHA works on a *method* basis - if you have a class with M methods, and a
subclass of that that overrides N methods (N < M), CHA is still applicable to
the non-overridden methods.
CHA is great when it works, but it is pretty useless for this issue. As Louis
points out above, interfaces like List will *always* have multiple loaded
implementations, so CHA will almost never apply for List. It can still apply
when the type of the object can be proven to be more specific than List, which
is what occurs in the first benchmark, so a concrete List implementation, like
ArrayList, may be subject to CHA if no class extends it (as in the benchmark),
but in most large projects, you are probably doing to have someone who extends
it (BTW, this is a good reason to make performance sensitive classes final).
The only time CHA shows up in the benchmarks above is the "fast case" in the
first benchmark, with all lists of size 3 or 4. ArrayList gets 60 ns, while
ImmList gets 90 ns, the difference being because of CHA which works for
ArrayList, but not for ImmList, which has multiple loaded (but not used, except
perhaps internally in guava) implementations. In the second benchmark, where
the array type is List rather than ArrayList/ImmutableList, CHA doesn't apply
(since the call is against List, which does have multiple implementations). So
CHA isn't big-picture important here because it is only comes in to play in the
60 ns vs 90 ns difference (which is 0.3 ns per call, since 100 lists are
involved - 0.3 ns is exactly one cycle).
Next, on to the optimization that is important here: inline caching.
This optimization applies when CHA can't - that is, when a method has overrides
in loaded classes. So it applies all the time for things like List, which
always have multiple implementations. In this case, the JIT takes advantage of
the fact that even though overrides may exist, at any particular call site (a
call site being a possibly inlined method invocation in the JIT'd assembly)
there is often exactly one implementing class in practice. For ArrayList this
is always true in my benchmark, and it is true for the minsize 3,4 case for
ImmList. Take a look at your codebase (not guava, though, since you don't know
who is calling your methods!) for say 10 random calls of List.get() or
List.size() and trace back all the possible origins of the List in question -
you'll often find that they'll always be a certain type.
What the JIT does here is looks at the profiling information gathered during
the interpreted phase, and if only 1 or 2 types of objects have been seen at
that call site, optimistically assumes that the type of the object will be
those 1 or 2 types (monomorphic and bimorphic cases). This allows it to
inline/optimize the method pretty much in the same way as CHA, above. The
difference is that the assumption might be wrong, and there is no classsloading
event & safepoint poll that can be hooked to guard this case, so they actually
need to insert a check in the generated assembly to ensure the type of the
object is one of the assumed type(s), immediately before the actual code of the
method. If this checks fails, the method is de-optimized and recompiled with
the new information. At most this can occur twice: mono -> bi, then bi -> mega.
In the benchmarks above, this is the reason for the 90 ns ImmList times vs the
60 ns ArrayList times - ArrayList was using CHA, while ImmList needed an inline
cache, and the extra check & branch cost 1 cycle. Still extremely fast. It's
also why the ImmList minsize=1 case takes 107 ns vs 90 ns for ImmList - we can
still use inline caches, because we are bimorphic (the singleton ImmList and
regular ImmList), but we need an extra check to cover the two cases (the branch
is well predicted because the pattern of lists is completely regular, but if it
wasn't this case would suck too).
When you have more than 2 implementation observed at a call site, however, the
JIT gives up - it doesn't inline the method and must compile an expensive
virtual method call. If the underlying type is a class, this involves some
indirection and parameter marshaling, and if it is an interface, involves that,
plus a linear search through the implemented interfaces to find the right one
(because the class hierarchy is a tree, but the interface + class hierarchy is
a DAG). This case is the ~1170 ns and ~2060 ns ImmList cases. The call site
has > 2 implementations of ImmList (or List in the second benchmark) so it is
megamorphic not only in practice, but in theory. The second benchmark is 2x as
bad as the first because the proven type of the receiver (object on which the
method is called) is List, an interface, not ImmutableList, an (abstract)
class, so it has to pay the additional interface search cost.
The key points about inline caches:
1) It applies only when CHA doesn't - that is, when there are multiple
implementations of a class/interface. More precisely, when a method is called
where there are multiple implementations of that particular method in
subclasses of the _proven_ type of the receiver.
2) It applies on a per-call site basis. Every call site is treated
independently by the above. So the optimization can apply in many cases for
some method n(), while failing at some particular call sites of n() where it
happens that multiple receivers are observered. So global program behavior
won't affect, in many cases, particular call sites.
3) It applies with 1 or 2 receiver types, not more (CHA applies with exactly 1
type).
4) It is based on the type of the object, not the specific method.
Of the points above, (2) and (4) are particularly important here.
(2) means that most of the discussion above about how many implementations of a
class/interface exist is probably wrongly aimed - it's not how many exist, but
how many actual instances of those implementations appear at a particular site.
In the benchmark, ArrayList is always actually an ArrayList because
Lists.newArrayList always returns an ArrayList. The ImmutableList, however,
even though they all originate from the same factory method,
ImmutableList.of(), vary in their most-derived types! This is why I focused on
the types returned by ImmutableList.of() - even though other specializations of
ImmutableList exist, they are returned in specific cases, involve a different
call path from a usual creation (sublist, reverse calls). On the other hand, 3
different implementations can be transparently returned by .of(), completely
undocumented, so you can get different concrete classes even if you did you
best to avoid it. Such instances are very likely to end up at the same call
site, even in high performance code, and cause poor performance, compared to
the other implementations, which will often never appear at any call site (see
the next part, for an important clarification about "call site").
The above makes clear that "call sites" are critically important. What is a
distinct call site? A sufficient condition is usually an actual call of a java
method in the java code:
static <E> boolean isInEither(List<? super E> list1, List<? super E> list2, E
elem) {
return list1.contains(elem) || list2.contains(elem);
}
However, this isn't a necessary condition. For example, the method above
contains two java-level call sites for contains() on the same line. This
method is extremely simple, probably less than 35 bytecodes, and certainly less
than 350 bytecodes, so it will in generally be inlined into any callers. This
means that the contains calls (which may or may not be inlined, but it doesn't
matter here) will actually appear at each call site of the isInEither() method,
meaning that different invocations of the method don't really interfere with
each other. If one method wants to use ArrayLists, and one method passes
LinkedLists, no problem, they don't de-optimize each other's inline caches
because due to inlining the "effective" call site is different - much more fine
grained that the (single) call site in the java code. This type of thing is
usually true for performance sensitive code: inlining means that the effective
call site is moved up the call stack to a method that is too large to be
inlined, which for hot code is around ~300 bytes . The exact behavior depends
on JVM arguments and can probably only be exactly determined with
+XX:PrintCompilation.
(4) means that unlike CHA, for inline caches to work it is only important if
you have seen multiple subclasses at the call site, not whether those
subclasses actually implement the called method. For example imagine you have
an override of ArrayList, which doesn't override any methods but implements the
Comparable interface (it doesn't matter why you have a subclass, I'm just
trying to come up with a realistic reason). For your new class, every method
in List is implemented by the ArrayList method, not your new method. However,
if you mix both at a call site, the call will be bimorphic, not monomorphic
based on line caching. Under CHA, a similar situation would be monomoprhic,
because CHA can tell that even though there is a subclass, it doesn't override
the method of interest. The way the "inline cache" is implemented, however is
as a simple single-word check of the expected class object versus the actual
class object - this happens in once cycle, as above - there is no general way
to efficiently implement the more complication question "does the receive
override method x" and the non-general ways probably aren't interesting enough
that the JIT has implemented them yet.
The key part of (4) is that unless CHA is in play you have to talk about types
- whether they have subclasses or not, and not whether methods have overrides
or not. This is a big downside of inline caches - even if you can design a
memory-efficient subclass, which overrides very few methods, you'll pay the
price on every method call of the superclass, not just on those that you
override.
Original comment by travis.d...@gmail.com
on 30 Jan 2013 at 6:54
All that said, if were building this myself, here are the choices I would make:
Two concrete implementations of ImmutableList:
RegularImmutableList (the current implementation)
ForwardingImmutableList (similar to the same thing above, but clarified below)
OrtherImmutableList can be be implemented exactly as a delegate, like:
class OtherImmutableList extends ImmutableList {
final OtherImmutableListDelegate delegate;
@Override
void size() {
delegate.size();
}
}
I would implement the size 0 and 1 cases of ImmutableList in terms of
RegularImmutable list - size 1 not being special at all, and size 0 always
returning an empty array version of RegularImmutableList (so being very fast
and zero memory cost).
The benefit of this is that it keeps the common case case, of regular immutable
lists, even of size 0 and 1. You'll likely find this faster in the case of 0
mixed with non-zero lists than the current implementation, yet not using any
more memory unless 1-element lists are common since the 0-element list is a
singleton. That is, I disagree with ogregoire's assertion that you'll find
Empty faster for larger list sizes - this won't happen except in extreme cases
where the entire working set can't fit in L3, and the Empty set avoids one
cache miss (doesn't need to examine .length).
For the unusual case, things will be about as a fast as possible. As long as
only 1 "other" implementation occurs at a call side, you'll end up with a
bimorphic check + momomorphic check and branch. I'm just guessing here (not
online) but something like 150-200 ns vs 90 ns for the monomorphic case.
Branch prediction will help a lot here, in since unlike JIT it is dynamic for
the life of the process and can recognize patterns. If you have mostly Regular
ImmLists mixed with a few "Other" lists (of any type) you'll get close to 90
ns, since the JIT will set up the branches right, and branch prediction will
get it right most of the time.
Original comment by travis.d...@gmail.com
on 30 Jan 2013 at 7:09
Sorry, above OtherImmutableList == RegularImmutableList, made a last second
edit that I can't fix now.
Original comment by travis.d...@gmail.com
on 30 Jan 2013 at 7:10
Thanks for all the details. I'm bookmarking this for future reference.
When it comes to other implementations, my concern is less with reverse() and
other oddballs and more with RegularImmutable*As*List, which is what you get
from calling ImmutableList.copyOf(immutableSetOfMultipleElements). I'm
envisioning a constructor that accepts an Iterable and calls copyOf() on it.
Some callers will pass an ImmutableSet and get RegularImmutableAsList; others
will pass other types and get RegularImmutableList.
I have no idea how common this is. That's the source of my fears that
ForwardingImmutableList may be common. (Even if it is, will the objects that
end up holding RegularImmutableList and the objects that end up holding
RegularImmutableAsList end up being the same "call sites," or will inlining
separate them? That's even harder to say. It's plausible, though.)
Having two types is, of course, a problem addressed by your solution of
RegularImmutableList+ForwardingImmutableList. And now, by eliminating
SingletonImmutableList, we're making tradeoffs between memory and CPU. Hmm.
Original comment by cpov...@google.com
on 30 Jan 2013 at 3:31
tl;dr
https://wikis.oracle.com/display/HotSpotInternals/PerformanceTechniques
Original comment by gak@google.com
on 30 Jan 2013 at 4:31
@cpovirk: I'm fairly convinced that making sure the "common case" -- a vanilla
ImmutableList -- is monomorphic should be our primary priority?
Original comment by lowas...@google.com
on 30 Jan 2013 at 5:46
Yeah I guess the ImmutableAsList stuff can definitely cause issues too, but as
you point out it all comes down to inlining: if the methods of the class that
takes Iterable in the constructor are small enough to inline into callers,
different users of that class shouldn't be impacted too much.
I see the behavior of ImmutableList.copyOf() as worse in the plain iterable
case, however - you can pass in a series of identically typed Iterables,
varying only by length, and get different implementations back.
At least in the ImmutableSetAsList case, the caller shares some responsibility
because he is passing in different types as input in the first place.
Original comment by travis.d...@gmail.com
on 30 Jan 2013 at 6:11
@lowasser, keep in mind that this whole issue is a tiny slice of the larger
performance problem. We have a benchmark (that needs review) that shows
certain characteristics for _runtime_ performance for a certain invocation
pattern. In general, we've made the choice to prioritize memory consumption
over CPU for both the raw # of bytes in the heap and the overhead of GC. We
haven't even started the discussion about CPU cost relative to any possible
memory implications.
Basically, I think this is a great discussion that draws the conclusion that
"all things being equal, monomorphic calls produce faster code than megamorphic
calls". I agree. The JDK developers agree. We just need to make sure that
we're keeping larger performance picture in mind.
Making ImmutableList monomorphic shouldn't be our "primary priority" as it is a
means, not an end. Making certain calls monomorphic that are likely to be
invoked in certain patterns is as good of a strategy as any other provided
informed decisions are made about the trade-offs.
Original comment by gak@google.com
on 30 Jan 2013 at 7:06
Well at least the case of EmptyImmutableList isn't a case of favoring memory
use over CPU, sine it's a singleton. So at least there the intent must have
been a faster specialization since all methods can be hardcoded.
That seems like a straightforward win to remove - no increase in memory use
(ok, a few bytes process wide) and should be faster bit lower in practice,
except in degenerate cases like a huge number of empty lists.
Size 1 case is definitely more subtle - I'd be curious if your internal
profiling shows these make up a large amount of aggregate memory use.
Original comment by travis.d...@gmail.com
on 31 Jan 2013 at 4:30
This issue has been migrated to GitHub.
It can be found at https://github.com/google/guava/issues/<id>
Original comment by cgdecker@google.com
on 1 Nov 2014 at 4:13
Original comment by cgdecker@google.com
on 1 Nov 2014 at 4:18
Original comment by cgdecker@google.com
on 3 Nov 2014 at 9:08
Original issue reported on code.google.com by
travis.d...@gmail.com
on 27 Jan 2013 at 5:39Attachments: