Iterators.removeAll() could be faster if it used a temporary HashSet

GoogleCodeExporter commented 9 years ago

There is a a performance problem in Iterators.removeAll().  It appears
in the latest revision (60fabfb..).  The problem is similar to Issue
1085, which is currently under research.  I attached a test that
exposes this problem and a patch that fixes it.  For this test, the
patch provides a 82X speedup.

To run the test, do:

$ java Test

The output for the un-patched version is:
Time is 2814

The output for the patched version is:
Time is 34

The problem is that 
"Iterators.removeAll(Iterator<?> removeFrom, Collection<?> elementsToRemove)"
can have quadratic complexity.  For each element in "removeFrom",
"Iterators.removeAll(...)" calls "elementsToRemove.contains(...)",
and the latter has linear complexity when "elementsToRemove" is a List.

I attached two patches similar to the patches in Issue 1085.  Both
patches implement a linear time algorithm using a HashSet.
patchSmall.diff populates the HashSet eagerly, and patchFull.diff
populates the HashSet lazily.  patchFull.diff is faster than
patchSmall.diff when "elementsToRemove.contains(...)" returns "true"
after inspecting only a few elements, though in most cases they are
equally fast.  I am including patchSmall.diff because the code is
smaller (it has only one line).

Original issue reported on code.google.com by adi1756...@gmail.com on 12 Sep 2012 at 1:56

Attachments:

GoogleCodeExporter commented 9 years ago

I'm of the opinion that the current behavior of methods like containsAll and 
removeAll is exactly what one would expect (particularly given that 
AbstractCollection does the exact same thing) and that wrapping the input in a 
HashSet before passing it to the method if you want to guarantee linear 
complexity is simple, clear and gives the user the choice in the tradeoff 
between time complexity and using extra space. I'm tempted to mark this as 
WorkingAsIntended, but I'll let others chime in if they have any opinions.

Original comment by cgdec...@gmail.com on 12 Sep 2012 at 2:49

Added labels: Package-Collect, Type-Performance

GoogleCodeExporter commented 9 years ago

I concur with Colin.  This is WorkingAsIntended.

Original comment by lowas...@google.com on 12 Sep 2012 at 2:57

GoogleCodeExporter commented 9 years ago

> wrapping the input in a HashSet before passing it to the method if
> you want to guarantee linear complexity is simple

You are right, it's simple to fix.  However, developers don't think at
the complexity of each method *call* each time they *type* one.
Coding like this would impossible.  In fact, libraries should simplify
coding and hide complexity from developers.

If we can make a slow method fast, why not do it?  Especially when the
fix is one line of code in the library, and it hides complexity from
the developer.

Original comment by adi1756...@gmail.com on 12 Sep 2012 at 3:29

GoogleCodeExporter commented 9 years ago

I don't think any of us can guarantee that there aren't users, in the wild, 
passing TreeSets to this method -- with comparators that aren't consistent with 
equals.

Are those users being a little bit naughty?  Yes.  Do those users deserve to be 
broken?  I don't think so.

Even worse, those users couldn't figure out their problem without going into 
the Guava source, which makes me deeply uncomfortable.  Users should never 
*have* to go into the Guava source to figure out their problem.

Original comment by lowas...@google.com on 12 Sep 2012 at 4:11

GoogleCodeExporter commented 9 years ago

Related to this topic, it would be good to add a LazyHashSet class in
Guava.  This would help developers optimize their code from the call
site, like you suggested.  A LazyHashSet class would help both in
Issue 1144 and in Issue 1085.

Currently there is no way to populate a hash lazily from the call site
(like in patchFull.diff).  However, populating the hash lazily has
higher performance than populating it eagerly (see Comment 11 in Issue
1085).

For example, with a LazyHashSet class, the call in Issue 1144 would
look like: 
"Iterators.removeAll(removeFrom, new LazyHashSet(elementsToRemove))".

Original comment by adi1756...@gmail.com on 12 Sep 2012 at 4:23

GoogleCodeExporter commented 9 years ago

From Comment 4:
> passing TreeSets to this method 

TreeSet is not a List.  The patch explicitly checks if
"elementsToRemove" is a List, and only then the patch triggers the
fast execution path (that uses the HashSet).  Therefore, passing
TreeSets works fine because it executes the current code.

Original comment by adi1756...@gmail.com on 12 Sep 2012 at 7:48

GoogleCodeExporter commented 9 years ago

(Also, thanks to protobuf we also have to consider the (rare) case of elements 
that don't implement hashCode at all.)

Original comment by kevinb@google.com on 12 Sep 2012 at 10:15

Changed title: Iterators.removeAll() could be faster if it used a temporary HashSet

GoogleCodeExporter commented 9 years ago

If it's difficult to fix in the callee, maybe we should explicitly
document this quadratic behavior in the method documentation.  Here is
an example inspired from ListUtils in Apache Collections:

......................................................................
 * <p> 
 * This implementation iterates over <code>removeFrom</code>, checking
 * each element in turn to see if it's contained in 
 * <code>elementsToRemove</code>.  If it's contained, it's removed
 * from <code>removeFrom</code>.  As a consequence, it is advised to 
 * use a collection type for <code>elementsToRemove</code> that 
 * provides a fast (e.g. O(1)) implementation of
 * {@link Collection#contains(Object)}.
......................................................................

Also, can you add the LazyHashSet class (Comment 5), such that the
caller can build the hash lazily?  Right now, the user is forced to
build the HashSet eagerly, which is slower (up to 60X slower, Issue
1085 Comment 11) than building it lazily.

Original comment by adi1756...@gmail.com on 12 Sep 2012 at 10:52

GoogleCodeExporter commented 9 years ago

Certainly not without doing more thorough (and reliable) benchmarking than the 
technique used in that comment, I suspect.  Can you put together a benchmark 
with Caliper?  (https://code.google.com/p/caliper/)

Original comment by lowas...@google.com on 12 Sep 2012 at 11:41

GoogleCodeExporter commented 9 years ago

Yes, sure, I will upload one in a few days, maybe a week.

Original comment by adi1756...@gmail.com on 13 Sep 2012 at 10:42

GoogleCodeExporter commented 9 years ago

I uploaded a Caliper benchmark (Issue 1085 Comment 13) similar to the
non-Caliper benchmark in Issue 1085 Comment 11.  The results are the
same as for the non-Caliper version: the code using a lazy hash set is
faster then both the code using an eager hash set and the code using
the current quadratic implementation (full details in Issue 1085
Comment 11).  This quantitative data shows it makes sense to allow
developers to speed up their code using a LazyHashSet.

Original comment by adi1756...@gmail.com on 25 Sep 2012 at 10:32

GoogleCodeExporter commented 9 years ago

I attached a sketch for the LazyHashSet class that implements lazy
"contains" and "containsAll".  This is an example usage:

"LazyHashSet.create(list).containsAll(other);"

instead of:

"list.containsAll(other)"

Original comment by adi1756...@gmail.com on 2 Oct 2012 at 2:23

Attachments:

LazyHashSet.java

GoogleCodeExporter commented 9 years ago

We intend for these methods to do the predictable thing. If anything more 
complex is required, we suggest using your own implementations.

Original comment by kak@google.com on 22 Aug 2013 at 10:14

Changed state: WorkingAsIntended

GoogleCodeExporter commented 9 years ago

This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:13

Added labels: MigratedToGitHub

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:08

Added labels: Restrict-AddIssueComment-Commit

dnrajugade / guava-libraries

Iterators.removeAll() could be faster if it used a temporary HashSet #1144