DaveAKing / guava-libraries

Automatically exported from code.google.com/p/guava-libraries
Apache License 2.0
0 stars 0 forks source link

Guava ImmutableList (and others) offer awful performance in some cases due to size-optmized specializations #1268

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Many of the guava Immutable collections have a cute trick where they have 
specializations for zero (EmptyImmutableList) and one (SingletonImmutableList) 
element collections. These specializations take the form of subclasses of 
ImmutableList, to go along with the "Regular" implementation and a few other 
specializations like ReverseImmutable, SubList, etc.

Unfortunately, the result is that when these subclasses mix at some call site, 
the call is megamorphic, and performance is awful compared to classes without 
these specializations (worse by a factor of 20 or more).

Here's a simple benchmark that shows ImmutableList vs ArrayList - given that 
ImmutableList is a simple ArrayList, without the need for double bounds checks 
on the upper bound, you'd expect it to at least be comparable:

    benchmark minSize     ns linear runtime
    ArrayList       0   60.9 =
    ArrayList       1   60.6 =
    ArrayList       2   60.6 =
    ArrayList       3   60.7 =
ImmutableList       0 1169.0 ==============================
ImmutableList       1  107.4 ==
ImmutableList       2   90.7 ==
ImmutableList       3   90.7 == 

The benchmark just calls size() repeatedly on 100 ArrayLists or ImmutableLists. 
 The sizes of the lists are evenly distributed in [minSize, 4].

You can see that when all lists have at least 2 elements, performance is 
comparable (~91 ns vs ~61 ns) - with the difference here being attributed to 
CHA - in the ArrayList case the compiler can prove that ArrayList class is 
effectively final, and can avoid the inline type check (so even in the >= 2 
element case, the specializations hurt).

With 1 element lists present, the call is bimorphic, so it can still be inlined 
and optimized, but the extra check costs a bit.

With 0 element lists, performance tanks.  The call is megamorphic and can't be 
well optimized.  The cost of the call is ~20 times worse than ArrayList.

The penalty applies every List<> call, not just size().

In the above benchmark, the type of the array of List was ArrayList[] and 
ImmutableList[] respectively.

If you change that to be List[] in both cases, the performance degrades even 
more:

    benchmark minSize     ns linear runtime
    ArrayList       0   90.6 =
    ArrayList       1   90.8 =
    ArrayList       2   90.7 =
    ArrayList       3   90.8 =
ImmutableList       0 2061.3 ==============================
ImmutableList       1  115.2 =
ImmutableList       2   90.6 =
ImmutableList       3   90.7 =

CHA isn't in play now, because the type of the array is List, which has 
multiple implementations, so ArrayList degrades to ~91 ns, just like 
ImmutableList.

The worse case ImmutableList performance has been cut in half, however, since 
now the call is a megamorphic invokeinterface, rather than invokevirtual, ugh.

This kind of scenario is not at all far fetched in real world code - it is 
natural to have lists of 0 and 1 length (which is probably why these 
specializations were created in the first place), and it is not unusual to find 
them mixed at a single call site - often in a hot method that takes a List<> as 
input. Unlike something like branch prediction, the worst case behavior will 
occur permanently once the specializations have *ever* been seen at the call 
site.  Even if 0 or 1 element arrays are very uncommon, once you see one of 
each, you are hosed until you restart the VM (at least in every JIT that I know 
of).

The benchmark doesn't even test the worst case - the pattern of lists sizes is 
totally regular, so the branches inherent in the bimorphic and megamorphic 
dispatch will be well predicted.  Randomize it and it will get worse 
(especially the bimorphic case because it is pretty fast so has a lot further 
to call).

The same issue could occur with SubList and perhaps ReverseImmutableList also, 
although probably less frequently, and there may be no good alternative there.

A reasonable compromise here would be to keep only the Singleton list 
implementation, and use a global static instance of that for the 0-element case 
also, with null element array, and special case all the methods to do the right 
thing for a null array.  This will get you most of the memory and GC benefit of 
the current implementation (only a slight increase for 0-element lists), while 
making the worse case bimorphic, which isn't too bad.

Alternately, you could keep only the 0-element case, and group 1-element lists 
into the general case.  This will increase memory use by 2.5x (16 bytes vs 40 
bytes on hotspot bytes according to my back-of-napkin calcs), and probably have 
somewhat worse runtime performance for heavy use of 1-element lists.

Benchmark attached.

Original issue reported on code.google.com by travis.d...@gmail.com on 27 Jan 2013 at 5:39

Attachments:

GoogleCodeExporter commented 9 years ago
Benchmark attached here...

Original comment by travis.d...@gmail.com on 27 Jan 2013 at 5:45

Attachments:

GoogleCodeExporter commented 9 years ago
You can uncomment the block above setUp(), and comment out the corresponding 
block right above that to try the List[] case (second set of results above).

Original comment by travis.d...@gmail.com on 27 Jan 2013 at 5:46

GoogleCodeExporter commented 9 years ago
Wow, thanks. Coincidentally, we recently started some internal discussions 
about the value of methods like EmptyImmutableList.isEmpty(). Your benchmark 
numbers are beyond my worst fears (20x??). Further, your reasoning shows that a 
solution would have to do more than just remove method overrides: It would 
actually have to remove implementation classes. And, as you point out, there 
are more of these than Empty+Singleton+Regular. (See also *Immutable*AsList.) 
For this reason, List in particular may end up being a lost cause, but this 
deserves more thought. Maybe we can hide all the "weird" implementations behind 
a ForwardingImmutableList that delegates to 
non-ImmutableList-but-actually-immutable delegates? Or maybe we should focus 
our attention on immutable classes with fewer implementations 
(ImmutableMultimap)?

The distressing part of this is that the primary goal of the multiple "normal" 
ImmutableList implementations is to save memory, and there's only so much we 
can do to preserve the memory savings while eliminating subclasses. I like your 
suggestion of eliminating EmptyImmutableList (with the caveat that the 
existence of other ImmutableList implementations may still be a megamorphism 
problem). Another way of doing that is to use a RegularImmutableList for empty 
lists.

The overall size-space tradeoff is a hard one to evaluate, so I don't know 
exactly what we'll do here, but SingletonImmutableList, for one, looks tenuous. 
Thanks for supplying the numbers that have been lacking in our evaluation so 
far.

Original comment by cpov...@google.com on 28 Jan 2013 at 4:38

GoogleCodeExporter commented 9 years ago
Please try to @Param numLists with other values too. This show the same 
tendency, but higher numLists show less and less difference compared to 100. 
Also, the singleton list may be an actual gain of performance with higher 
values of numLists.

I actually don't know how to interpret that, so I'm just giving a lead here.

Original comment by ogregoire on 28 Jan 2013 at 5:12

GoogleCodeExporter commented 9 years ago
ImmutableList does seem likely to always have many subclasses -- mostly the 
ImmutableAsLists.  

Chris, I'm not sure I follow how ForwardingImmutableList would work, or would 
save anything?

I think we could safely eliminate EmptyImmutableList and its brethren -- as 
long as we maintain a singleton holding the empty ImmutableList instance, it 
will only require a constant amount of memory for the whole VM.

That said, I'm not 100% convinced the benchmark is valid, if only because the 
List interface itself is likely to have more implementations used than just 
ArrayList and ImmutableList...?  Most implementations will be List references, 
not ArrayList-specific references, so I'd expect most operations on a basic 
List to be polymorphic calls.

Original comment by lowas...@google.com on 28 Jan 2013 at 5:53

GoogleCodeExporter commented 9 years ago
> Chris, I'm not sure I follow how ForwardingImmutableList would work, or
> would save anything?

We can sacrifice performance on uncommon implementations by making the common 
call bimorphic:

ImmutableList has two subclasses, RegularImmutableList and 
ForwardingImmutableList. RegularImmutableList is used for 0-, many-, and 
perhaps 1-element lists. ForwardingImmutableList is used for everything else, 
including ImmutableAsList. ForwardingImmutableList is a wrapper around a plain 
(though immutable) List. Any other "ImmutableList implementations" are actually 
List implementations that get wrapped in a ForwardingImmutableList.

> That said, I'm not 100% convinced the benchmark is valid, if only because
> the List interface itself is likely to have more implementations used than
> just ArrayList and ImmutableList...?  Most implementations will be List
> references, not ArrayList-specific references, so I'd expect most operations
> on a basic List to be polymorphic calls.

While we have been encouraging people to store references in fields of type 
ImmutableList, we can be sure that that advice isn't always followed. In any 
case, it would be nice if the people who did follow the advice could get a 
performance boost out of it.

But there may be only so much gain from any of these approaches. It may be that 
the right takeaway from all this is that our attempts to subclass ImmutableList 
for speed are fruitless because, even if we could eliminate the entire 90ns 
cost of "the method itself," the percent performance increase would be in the 
single digits. Despite years of advice that the immutable implementations are 
probably more efficient than the JDK classes, we've avoided writing these 
benchmarks. (Well, at least we have some for Set.)

Original comment by cpov...@google.com on 28 Jan 2013 at 6:28

GoogleCodeExporter commented 9 years ago
To be clear, I'm arguing that the benchmark *overestimates* the speed of 
ArrayList.  I also have to admit that I'd've assumed that if the 
ForwardingImmutableList procedure you describe worked, then the JIT would 
already have implemented a similar technique to deal with polymorphic calls 
when certain implementations were much more common than others.

Original comment by lowas...@google.com on 28 Jan 2013 at 6:34

GoogleCodeExporter commented 9 years ago
Given the discussion so far, I think there is a bit of a misunderstanding about 
what makes things slow, so here's a quick summary based on the comments above, 
and then the gory details follow in the next post.

>  Most implementations will be List
> references, not ArrayList-specific references, so I'd expect most operations
> on a basic List to be polymorphic calls.

The important thing for performance isn't whether the "proven" type of the 
reference has subclasses, but whether multiple implementations have actually 
been seen in practice _at some call site_. In the language of my post to 
follow, CHA doesn't apply if List has multiple implementations, but "inline 
caching" does, as long as only one implementation is seen in practice, for each 
call site. That's why ArrayList is fast: one class returned by 
Lists.newArrayList(), and ImmutableList is slow: multiple implementations 
returned in practice (for minSize in [0, 1] - but the 2 implementation case is 
still optimized well).

This is shown clearly by the second benchmark - the type is now List<>, but 
ArrayList and ImmutableList (for length > 1) are still very fast (a 0.3 ns 
penalty per call, since CHA isn't in play).  The massive penalty incurred by 
the > 1 implementations of ImmList is still being paid, and the relative cost 
is even greater (20x now vs 10x) since interface types are being used).

Declaring fields as ImmutableList vs List actually does give a performance 
boost, but only because ImmutableList is a class, and not an interface, so it 
can use the faster vtable dispatch vs the interface dispatch. That's secondary 
to the issue here - even virtual dispatch is very slow compared to inlined & 
fully optimized access.

> Maybe we can hide all the "weird" implementations behind a 
ForwardingImmutableList
> that delegates to non-ImmutableList-but-actually-immutable delegates? Or 
maybe we
> should focus our attention on immutable classes with fewer implementations 
(ImmutableMultimap)?

As above, the issue isn't total implementations, but total implementations 
likely to be seen at one call side.  I focused on the types returned by 
ImmutableList.of() because three different implementations can be transparently 
returned by .of(), completely undocumented, so you can get different concrete 
classes even if you did your best to avoid it.  Such instances are very likely 
to end up at the same call site, and cause poor performance, compared to the 
other implementations, which will often never appear at any call site (see the 
next part, for an important clarification about "call site").

The other implementations, while common in some cases, only arise as the result 
of distinct java call sequences, so are likely to be limited to many fewer 
_effective_ call sites (the difference between java-level call sites and 
effective call sites after inlining is critical, so see the last part of the 
next post).

Original comment by travis.d...@gmail.com on 30 Jan 2013 at 6:17

GoogleCodeExporter commented 9 years ago
Here are the gory details about JIT method inlining & optimization in general. 
When I say "JIT" here, I mean hotspot, but I don't think it is going to vary 
too much across other state-of-the-art JITs out there - or at least they'll 
usually be worse in specific cases, not better. This should help you when you 
are analyzing the tradeoffs to make.

Unlike say, C++, every non-final instance method in Java is potentially virtual 
- that is, polymorphic dispatch is the default.  Naively implemented, these 
calls are brutal for performance - partly because the dispatch requires some 
indirection but *mostly* because they aren't inlined.  Once a method isn't 
inlined you lose all sorts of optimizations involving the caller and callee, so 
it is really the "granddaddy" optimization that enables the rest of the 
optimization family to kick in for hot code with method calls. It's worth 
noting that the fast cases for my benchmark take between 0.6 ns and 0.9 ns per 
call, which is exactly 2 or 3 cycles (3.33 GHz box), which is pretty amazing 
given that there are a dozen bytecodes or more involved in each loop.  The JIT 
has done a good job!

Because inlining methods is so important and virtual methods are so important, 
a key optimization for the JIT is "devirtualization" - taking a method which is 
polymorphic in theory and compiling it as if the actual implementation was 
known, including inlining it. It does this through two very different 
mechanisms, with different behaviors and different implications for class 
hierarchy design, which I will describe below:

Class Hierarchy Analysis (CHA):

This is the trick that makes the final on methods keyword mostly useless for 
performance, despite lots of out of date advice to the contrary (still useful 
as a enforced restriction at compile time). If a method m() is declared final 
on class C, and the compiler *knows* that for an object of type C or a subclass 
(but not a superclass), a call to m() will always resolve to the single final 
implementation, and it can be fully inlined and optimized.

Most methods, however, are not declared final, but are _effectively_ final 
since they have no overrides in the currently loaded set of classes.  Since the 
JVM knows, at all times, the complete set of loaded classes, it can perform 
this "is effectively final" analysis for each method.  If m() has no overrides 
in any loaded class, it can be optimized just as if it was final.

The set of loaded classes can change of course, so some CHA-based assumptions 
which where valid when some code was compiled may become invalid later.  This 
is handled through deoptimization when a class is loaded that violates earlier 
CHA assumptions. The deoptimization always occurs at a safepoint, so it 
piggy-backs on the existing safepoint polling mechanisms and basically means 
that CHA is "free" from the point of view of the compiled code - there is no 
trace of the fact that the CHA could become invalid in the generated assembly, 
other than safepoint polling which is required regardless.

OK, so the key points about CHA based devirtualization:

1) It applies to the entire VM, per method - either some method can be 
devirtualized via CHA everywhere, or nowhere (actually it could be 
per-classloader, but the general idea that it's a global optimization is the 
same).
2) As a consequence of (1), if any class is loaded that overrides a method, CHA 
for that method is invalid everywhere a method is called, even if some call 
site never actually sees the overridden classes.
3) CHA works on a *method* basis - if you have a class with M methods, and a 
subclass of that that overrides N methods (N < M), CHA is still applicable to 
the non-overridden methods.

CHA is great when it works, but it is pretty useless for this issue.  As Louis 
points out above, interfaces like List will *always* have multiple loaded 
implementations, so CHA will almost never apply for List. It can still apply 
when the type of the object can be proven to be more specific than List, which 
is what occurs in the first benchmark, so a concrete List implementation, like 
ArrayList, may be subject to CHA if no class extends it (as in the benchmark), 
but in most large projects, you are probably doing to have someone who extends 
it (BTW, this is a good reason to make performance sensitive classes final).

The only time CHA shows up in the benchmarks above is the "fast case" in the 
first benchmark, with all lists of size 3 or 4.  ArrayList gets 60 ns, while 
ImmList gets 90 ns, the difference being because of CHA which works for 
ArrayList, but not for ImmList, which has multiple loaded (but not used, except 
perhaps internally in guava) implementations.  In the second benchmark, where 
the array type is List rather than ArrayList/ImmutableList, CHA doesn't apply 
(since the call is against List, which does have multiple implementations).  So 
CHA isn't big-picture important here because it is only comes in to play in the 
60 ns vs 90 ns difference (which is 0.3 ns per call, since 100 lists are 
involved - 0.3 ns is exactly one cycle).

Next, on to the optimization that is important here: inline caching.

This optimization applies when CHA can't - that is, when a method has overrides 
in loaded classes.  So it applies all the time for things like List, which 
always have multiple implementations.  In this case, the JIT takes advantage of 
the fact that even though overrides may exist, at any particular call site (a 
call site being a possibly inlined method invocation in the JIT'd assembly) 
there is often exactly one implementing class in practice.  For ArrayList this 
is always true in my benchmark, and it is true for the minsize 3,4 case for 
ImmList.  Take a look at your codebase (not guava, though, since you don't know 
who is calling your methods!) for say 10 random calls of List.get() or 
List.size() and trace back all the possible origins of the List in question - 
you'll often find that they'll always be a certain type.

What the JIT does here is looks at the profiling information gathered during 
the interpreted phase, and if only 1 or 2 types of objects have been seen at 
that call site, optimistically assumes that the type of the object will be 
those 1 or 2 types (monomorphic and bimorphic cases). This allows it to 
inline/optimize the method pretty much in the same way as CHA, above.  The 
difference is that the assumption might be wrong, and there is no classsloading 
event & safepoint poll that can be hooked to guard this case, so they actually 
need to insert a check in the generated assembly to ensure the type of the 
object is one of the assumed type(s), immediately before the actual code of the 
method. If this checks fails, the method is de-optimized and recompiled with 
the new information. At most this can occur twice: mono -> bi, then bi -> mega. 
 In the benchmarks above, this is the reason for the 90 ns ImmList times vs the 
60 ns ArrayList times - ArrayList was using CHA, while ImmList needed an inline 
cache, and the extra check & branch cost 1 cycle.  Still extremely fast.  It's 
also why the ImmList minsize=1 case takes 107 ns vs 90 ns for ImmList - we can 
still use inline caches, because we are bimorphic (the singleton ImmList and 
regular ImmList), but we need an extra check to cover the two cases (the branch 
is well predicted because the pattern of lists is completely regular, but if it 
wasn't this case would suck too).

When you have more than 2 implementation observed at a call site, however, the 
JIT gives up - it doesn't inline the method and must compile an expensive 
virtual method call.  If the underlying type is a class, this involves some 
indirection and parameter marshaling, and if it is an interface, involves that, 
plus a linear search through the implemented interfaces to find the right one 
(because the class hierarchy is a tree, but the interface + class hierarchy is 
a DAG).  This case is the ~1170 ns and ~2060 ns ImmList cases.  The call site 
has > 2 implementations of ImmList (or List in the second benchmark) so it is 
megamorphic not only in practice, but in theory.  The second benchmark is 2x as 
bad as the first because the proven type of the receiver (object on which the 
method is called) is List, an interface, not ImmutableList, an (abstract) 
class, so it has to pay the additional interface search cost.

The key points about inline caches:

1) It applies only when CHA doesn't - that is, when there are multiple 
implementations of a class/interface.  More precisely, when a method is called 
where there are multiple implementations of that particular method in 
subclasses of the _proven_ type of the receiver.
2) It applies on a per-call site basis.  Every call site is treated 
independently by the above.  So the optimization can apply in many cases for 
some method n(), while failing at some particular call sites of n() where it 
happens that multiple receivers are observered.  So global program behavior 
won't affect, in many cases, particular call sites.
3) It applies with 1 or 2 receiver types, not more (CHA applies with exactly 1 
type).
4) It is based on the type of the object, not the specific method.

Of the points above, (2) and (4) are particularly important here.

(2) means that most of the discussion above about how many implementations of a 
class/interface exist is probably wrongly aimed - it's not how many exist, but 
how many actual instances of those implementations appear at a particular site. 
 In the benchmark, ArrayList is always actually an ArrayList because 
Lists.newArrayList always returns an ArrayList. The ImmutableList, however, 
even though they all originate from the same factory method, 
ImmutableList.of(), vary in their most-derived types! This is why I focused on 
the types returned by ImmutableList.of() - even though other specializations of 
ImmutableList exist, they are returned in specific cases, involve a different 
call path from a usual creation (sublist, reverse calls). On the other hand, 3 
different implementations can be transparently returned by .of(), completely 
undocumented, so you can get different concrete classes even if you did you 
best to avoid it.  Such instances are very likely to end up at the same call 
site, even in high performance code, and cause poor performance, compared to 
the other implementations, which will often never appear at any call site (see 
the next part, for an important clarification about "call site").

The above makes clear that "call sites" are critically important.  What is a 
distinct call site? A sufficient condition is usually an actual call of a java 
method in the java code:

static <E> boolean isInEither(List<? super E> list1, List<? super E> list2, E 
elem) {
   return list1.contains(elem) || list2.contains(elem);
}

However, this isn't a necessary condition.  For example, the method above 
contains two java-level call sites for contains() on the same line.  This 
method is extremely simple, probably less than 35 bytecodes, and certainly less 
than 350 bytecodes, so it will in generally be inlined into any callers.  This 
means that the contains calls (which may or may not be inlined, but it doesn't 
matter here) will actually appear at each call site of the isInEither() method, 
meaning that different invocations of the method don't really interfere with 
each other.  If one method wants to use ArrayLists, and one method passes 
LinkedLists, no problem, they don't de-optimize each other's inline caches 
because due to inlining the "effective" call site is different - much more fine 
grained that the (single) call site in the java code. This type of thing is 
usually true for performance sensitive code: inlining means that the effective 
call site is moved up the call stack to a method that is too large to be 
inlined, which for hot code is around ~300 bytes . The exact behavior depends 
on JVM arguments and can probably only be exactly determined with 
+XX:PrintCompilation.

(4) means that unlike CHA, for inline caches to work it is only important if 
you have seen multiple subclasses at the call site, not whether those 
subclasses actually implement the called method.  For example imagine you have 
an override of ArrayList, which doesn't override any methods but implements the 
Comparable interface (it doesn't matter why you have a subclass, I'm just 
trying to come up with a realistic reason).  For your new class, every method 
in List is implemented by the ArrayList method, not your new method.  However, 
if you mix both at a call site, the call will be bimorphic, not monomorphic 
based on line caching.  Under CHA, a similar situation would be monomoprhic, 
because CHA can tell that even though there is a subclass, it doesn't override 
the method of interest.  The way the "inline cache" is implemented, however is 
as a simple single-word check of the expected class object versus the actual 
class object - this happens in once cycle, as above - there is no general way 
to efficiently implement the more complication question "does the receive 
override method x" and the non-general ways probably aren't interesting enough 
that the JIT has implemented them yet.

The key part of (4) is that unless CHA is in play you have to talk about types 
- whether they have subclasses or not, and not whether methods have overrides 
or not. This is a big downside of inline caches - even if you can design a 
memory-efficient subclass, which overrides very few methods, you'll pay the 
price on every method call of the superclass, not just on those that you 
override.

Original comment by travis.d...@gmail.com on 30 Jan 2013 at 6:54

GoogleCodeExporter commented 9 years ago
All that said, if were building this myself, here are the choices I would make:

Two concrete implementations of ImmutableList:

RegularImmutableList (the current implementation)
ForwardingImmutableList (similar to the same thing above, but clarified below)

OrtherImmutableList can be be implemented exactly as a delegate, like:

class OtherImmutableList extends ImmutableList {

  final OtherImmutableListDelegate delegate;

  @Override
  void size() {
    delegate.size();
  }
}

I would implement the size 0 and 1 cases of ImmutableList in terms of 
RegularImmutable list - size 1 not being special at all, and size 0 always 
returning an empty array version of RegularImmutableList (so being very fast 
and zero memory cost).

The benefit of this is that it keeps the common case case, of regular immutable 
lists, even of size 0 and 1.  You'll likely find this faster in the case of 0 
mixed with non-zero lists than the current implementation, yet not using any 
more memory unless 1-element lists are common since the 0-element list is a 
singleton.  That is, I disagree with ogregoire's assertion that you'll find 
Empty faster for larger list sizes - this won't happen except in extreme cases 
where the entire working set can't fit in L3, and the Empty set avoids one 
cache miss (doesn't need to examine .length). 

For the unusual case, things will be about as a fast as possible. As long as 
only 1 "other" implementation occurs at a call side, you'll end up with a 
bimorphic check + momomorphic check and branch.  I'm just guessing here (not 
online) but something like 150-200 ns vs 90 ns for the monomorphic case.  
Branch prediction will help a lot here, in since unlike JIT it is dynamic for 
the life of the process and can recognize patterns.  If you have mostly Regular 
ImmLists mixed with a few "Other" lists (of any type) you'll get close to 90 
ns, since the JIT will set up the branches right, and branch prediction will 
get it right most of the time.

Original comment by travis.d...@gmail.com on 30 Jan 2013 at 7:09

GoogleCodeExporter commented 9 years ago
Sorry, above OtherImmutableList == RegularImmutableList, made a last second 
edit that I can't fix now.

Original comment by travis.d...@gmail.com on 30 Jan 2013 at 7:10

GoogleCodeExporter commented 9 years ago
Thanks for all the details. I'm bookmarking this for future reference.

When it comes to other implementations, my concern is less with reverse() and 
other oddballs and more with RegularImmutable*As*List, which is what you get 
from calling ImmutableList.copyOf(immutableSetOfMultipleElements). I'm 
envisioning a constructor that accepts an Iterable and calls copyOf() on it. 
Some callers will pass an ImmutableSet and get RegularImmutableAsList; others 
will pass other types and get RegularImmutableList.

I have no idea how common this is. That's the source of my fears that 
ForwardingImmutableList may be common. (Even if it is, will the objects that 
end up holding RegularImmutableList and the objects that end up holding 
RegularImmutableAsList end up being the same "call sites," or will inlining 
separate them? That's even harder to say. It's plausible, though.)

Having two types is, of course, a problem addressed by your solution of 
RegularImmutableList+ForwardingImmutableList. And now, by eliminating 
SingletonImmutableList, we're making tradeoffs between memory and CPU. Hmm.

Original comment by cpov...@google.com on 30 Jan 2013 at 3:31

GoogleCodeExporter commented 9 years ago
tl;dr
https://wikis.oracle.com/display/HotSpotInternals/PerformanceTechniques

Original comment by gak@google.com on 30 Jan 2013 at 4:31

GoogleCodeExporter commented 9 years ago
@cpovirk: I'm fairly convinced that making sure the "common case" -- a vanilla 
ImmutableList -- is monomorphic should be our primary priority?

Original comment by lowas...@google.com on 30 Jan 2013 at 5:46

GoogleCodeExporter commented 9 years ago
Yeah I guess the ImmutableAsList stuff can definitely cause issues too, but as 
you point out it all comes down to inlining: if the methods of the class that 
takes Iterable in the constructor are small enough to inline into callers, 
different users of that class shouldn't be impacted too much.

I see the behavior of ImmutableList.copyOf() as worse in the plain iterable 
case, however - you can pass in a series of identically typed Iterables, 
varying only by length, and get different implementations back.

At least in the ImmutableSetAsList case, the caller shares some responsibility 
because he is passing in different types as input in the first place.

Original comment by travis.d...@gmail.com on 30 Jan 2013 at 6:11

GoogleCodeExporter commented 9 years ago
@lowasser, keep in mind that this whole issue is a tiny slice of the larger 
performance problem.  We have a benchmark (that needs review) that shows 
certain characteristics for _runtime_ performance for a certain invocation 
pattern.  In general, we've made the choice to prioritize memory consumption 
over CPU for both the raw # of bytes in the heap and the overhead of GC.  We 
haven't even started the discussion about CPU cost relative to any possible 
memory implications.

Basically, I think this is a great discussion that draws the conclusion that 
"all things being equal, monomorphic calls produce faster code than megamorphic 
calls".  I agree.  The JDK developers agree.  We just need to make sure that 
we're keeping larger performance picture in mind.

Making ImmutableList monomorphic shouldn't be our "primary priority" as it is a 
means, not an end.  Making certain calls monomorphic that are likely to be 
invoked in certain patterns is as good of a strategy as any other provided 
informed decisions are made about the trade-offs.

Original comment by gak@google.com on 30 Jan 2013 at 7:06

GoogleCodeExporter commented 9 years ago
Well at least the case of EmptyImmutableList isn't a case of favoring memory 
use over CPU, sine it's a singleton. So at least there the intent must have 
been a faster specialization since all methods can be hardcoded. 

That seems like a straightforward win to remove - no increase in memory use 
(ok, a few bytes process wide) and should be faster bit lower in practice, 
except in degenerate cases like a huge number of empty lists. 

Size 1 case is definitely more subtle - I'd be curious if your internal 
profiling shows these make up a large amount of aggregate memory use. 

Original comment by travis.d...@gmail.com on 31 Jan 2013 at 4:30

GoogleCodeExporter commented 9 years ago
This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:13

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:18

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:08