Closed GoogleCodeExporter closed 9 years ago
We have a few uses for such a thing ourselves. I started work on it once upon
a time...
The name has always bugged me, though; "PartitionedSet" would seem to make more
sense, but perhaps it's not worth going against the grain.
Original comment by kevinb@google.com
on 13 Jan 2011 at 1:25
for the general interest:
http://en.wikipedia.org/wiki/Disjoint-set_data_structure
Original comment by kevinb@google.com
on 13 Jan 2011 at 1:26
Any thoughts on what the API would look like?
Original comment by wasserman.louis
on 2 Feb 2011 at 6:05
Disturbingly, I can't find my work on it anywhere.
Rough thoughts were: DisjointSet<E> interface containing the read-only
methods. One to view as a Set<Set<E>>, one to view the union Set<E>, one to
get the Set<E> for a given E, one to view the entire thing as an Equivalence<E>.
Mutable implementation, with or without specialized mutable interface. Method
to declare e1 and e2 to be equivalent. This is where the traditional
union-find algorithm ends up. Method to remove an E?
Immutable implementation. Presumably backed by an ImmutableMap<E,
ImmutableSet<E>>.
Various other methods I forgot.
Louis, if you are up for taking a crack at this, this is something I feel we
could push through.
Original comment by kevinb@google.com
on 3 Feb 2011 at 3:07
Absolutely up for it -- after my paper due Friday. ;)
Original comment by wasserman.louis
on 3 Feb 2011 at 3:45
[deleted comment]
[deleted comment]
[deleted comment]
The key thing in this structure is that the *elements* point to the set,
instead of the opposite way which is what we typically mean when we say "a
Set". So, please, don't subtype Set, don't try to shoehorn an iterator or a
remove operation. The result would be negating the strengths of the structure;
the quintessential union/find is best.
After some digging, here's an implementation of my own, some 4-5 years old.
http://code.google.com/p/flexigraph/source/browse/trunk/src/gr/forth/ics/util/Pa
rtition.java
(It seems I took the time to implement the best find() version, instead of the
easiest, which is nice; single-pass, instead of two-pass or recursive).
Let me make a concrete recommendation now though. I'll call the structure
"Partition", below:
* Don't make Partition point to a user value. This is completely unnecessary; the user will associate each value to its Partition even if Partition offers the other direction. (I made this mistake)
* Following from the above: don't introduce a type parameter. This artificially makes it difficult to merge Partition<S> and Partition<T>. (I made this mistake too).
* No "iterator" or anything that requires the set to point to the elements, which defeats the purpose of the structure.
* Expose the equivalence relation only. A singleton Equivalence<Partition> is a fine way to do it. Having Equivalence<Partition<?>> would look ugly, for a reason (because the type parameter isn't needed!). Don't expose the representative of an element (btw, remember to do the self-loop trick to denote the root). Other than the equivalence, just singleton() and merge() need to be offered.
* Finally, consider offering this method:
public static <E> Equivalence<E> partition(Function<E, Partition> partitioner);
Which maps the equivalence relation of Partition to an equivalence relation of
referrers of Partition.
I don't care too much about the "Partition" name, but I do care about the rest.
Make a guy with a weak spot for graph algorithms happy - otherwise I'll go and
whine to Robert Tarjan himself. Seriously. (Ok, perhaps not that seriously, but
seriously, I'll be sad, this is a gem of a structure, lets preserve that).
Original comment by jim.andreou
on 3 Feb 2011 at 6:20
Ok, let's make sure we sort out what to do before you dig into this, Louis.
Original comment by kevinb@google.com
on 3 Feb 2011 at 2:02
I swear, I have a weaker spot for graph algorithms than you, and my biggest
soft spot of all is for data structure gems!
I like some ideas from your approach, Jim, but I think it can be made
considerably simpler. Specifically, I think the attached is pretty much your
approach taken to its logical conclusion -- to remove the elements from the
picture entirely, and treat the partitions as objects in their own right. The
effect of p1.combine(p2) is to guarantee that p1.equals(p2) is true for all
time, while maintaining that equals() is an equivalence relation. Elements
don't come into it at all.
Original comment by wasserman.louis
on 5 Feb 2011 at 12:32
Attachments:
I like the simplicity of equals(), heck, that's supposed to model an
equivalence class, and Equivalences already play nice with it, so it's fine.
About find(), I see you implemented the recursive one for simplicity (or
laziness :)) - but we ought to provide the fastest.
Now, about the rank... I'm on the fence. I think without the rank optimization,
one gets polylogarithmic complexity, instead of the inverse Ackerman (together
with path compression). The rank itself "ought" to increase the memory overhead
(I'm certain that some people more knowledgeable than me asserted that), but
I'm not so sure, it must be VM dependant. On the VMs I'm usually on, there
would be no observable difference to me, due to aligning effects (the existence
of the Partition pointer would bump the cost of the object from 8 bytes to 16
bytes, so in this case, the second field is free).
But without that field, it would be possible to pull off a concurrent,
non-blocking implementation (just via compare-and-swap on the pointer), which
would be nice, because this structure does play well with partitioning things
to parallelize, then merge the sets to compute some global result.
So, how about having an interface for this, making an sequential implementation
w/ the rank, and leave the door open for a non-blocking one without the rank.
Btw, I was quick to play down the "Partition" name, but I think it's quite good
(the other good names would be "Class" and "Set", obviously poor choices in
this context :)).
(PS: I'm _not_ going to argue with you about who is more addicted to graph
algorithms :P )
Original comment by jim.andreou
on 5 Feb 2011 at 1:01
Well, yes, I was going for a quick-and-dirty implementation, so I was somewhat
lazy.
I'm inclined to do the sequential implementation for the moment, with the
lovely inverse-Ackermann complexity.
Original comment by wasserman.louis
on 5 Feb 2011 at 1:08
I like the sound of the (very different from what I had been thinking!)
direction you guys are taking this.
Original comment by kevinb@google.com
on 5 Feb 2011 at 2:07
Oh, yeah. I went through several drafts of my first response that were along
the lines of "You're on crack! What a ridiculous data structure!" Then it was
like, oh man, that's actually not half bad. Then it was like, but this could
be so much simpler! Bam! Bam! .equals()!
So yeah.
Original comment by wasserman.louis
on 5 Feb 2011 at 5:06
Oh boy. How old is this guy? :)
Original comment by jim.andreou
on 5 Feb 2011 at 5:19
More seriously, let me back off that interface idea. Rethinking it, that would
create ugly problems, like sequential.{merge,equals}(nonSequential); // oops
So lets just have a single concrete class; if we need the other, we create a
second, unrelated class.
Original comment by jim.andreou
on 5 Feb 2011 at 5:24
I think that's definitely the way to go, and I think a sequential version is
the right default.
Original comment by wasserman.louis
on 5 Feb 2011 at 5:50
Just so I don't have this in my mind: Louis? Is this something we should expect
you give a full contribution of? (I mean, the final product, optimized,
documented -especially hashCode() and equals()-, some tests). Is it something
you're already working on?
Original comment by jim.andreou
on 14 Feb 2011 at 6:31
Yeah, I'm doing the whole shebang.
Original comment by wasserman.louis
on 14 Feb 2011 at 6:39
Ok, great, thanks!
Original comment by jim.andreou
on 14 Feb 2011 at 6:43
Observation: I think this should go in base, not collect, since it's really not
a collection in any way.
Original comment by wasserman.louis
on 14 Feb 2011 at 9:44
I'm late to this discussion, so please forgive me. If I'm grokking the gist of
the approach (referring to Wasserman's posted impl):
- the user is responsible for associating user-provided objects with Partition
instances
- Partitions never disappear, but are merged as "flatly" as possible
I really like the simplicity, but I have some questions about how one would use
it.
Wouldn't algorithms which start with a state of one object per partition be
somewhat inefficient? By that, I mean that the user of the library will be
keeping around a HashMap< E, Partition >, and there will permanently be just as
many Partitions as there are elements. Although I suppose if the user didn't do
it, then the Partition impl would have to. So maybe this is just shifting the
burden, with the possibility that the user won't care and it wouldn't be needed
at all.
Having run a connected components algorithm, I can easily determine whether any
two vertices are in the same connected component (via the Equivalence). But how
can I determine how many components there are? Or, how can I list the vertices
in each component? Do you need to expose the find() method to allow answering
questions like this, or is the intent that the user should be keeping track of
this kind of information? Decrementing the number of components every time a
combine() is successful is easy enough, and Sets.union() or similar handles the
latter question (with the user keeping track of all the Sets), but this does
seem like exactly the kind of thing someone might want to use Partition for.
Original comment by ray.a.co...@gmail.com
on 4 May 2011 at 12:10
"- the user is responsible for associating user-provided objects with Partition
instances"
Yes. Note that a more efficient alternative to Map<E, Partition>, is a
Partition field in E itself.
"- Partitions never disappear, but are merged as "flatly" as possible"
They disappear at the discretion of the garbage collector :)
One object per partition is the minimum you can get.
"But how can I determine how many components there are?"
A good start would be Multimaps#index(Iterable<E>, Function<E, Partition>),
which gives a Multimap<Partition, E>. But to turn it into a more useful
Multimap<E, E>, one would require the inverse relation, Map<Partition, E>,
which leads to rather verbose code (trust me, I tried it).
My original (internal) proposal included a "classify" method which for the same
arguments returned Multimap<E, E> instead, directly. It's still up in the air
(seconds ago I re-suggested that, with the return type being Map<E, Set<E>>,
lets see what happens to that). That would answer all your following questions;
exposing find() is still unnecessary.
Original comment by jim.andreou
on 4 May 2011 at 12:28
Maybe a contrived example would help, and you can tell me where my thinking has
gone wrong. I start out with a big graph...
Map< Node, Partition > map = Maps.newHashMap();
for( node : nodes ) {
map.put( node, Partition.newPartition() );
}
Now, inelegant brute-force connected components:
int size = nodes.size();
for( edge : edges ) {
Partition pTail = map.get( edge.tail );
Partition pHead = map.get( edge.head );
if( pTail.combine( pHead ) ) {
size--;
}
if( size == 1 ) {
break;
}
}
The calculation is done, and the Partition Equivalence would work. However,
there are still 1M Partition instances, one per node. From the information I
have, I cannot construct a Function<E, Partition> that is "normalized". So
Multimaps.index() wouldn't be useful.
This edit reduces the number of Partitions in use by one allowing pHead to be
gc'd, but only if pHead happens to only contain the one node.
if( pTail.combine( pHead ) ) {
map.put( edge.head, pTail );
size--;
}
As near as I can tell, the only way to keep track of this information is to
maintain an actual Map<Partition, Set<E>> during the algorithm, in addition to
the existing Map. A single graph data structure would be simpler, and then I
would effectively be performing the exact same function as Partition, and I
don't need the class at all.
What am I missing?
Original comment by ray.a.co...@gmail.com
on 4 May 2011 at 6:31
#index you would get a Multimap<Partition, E>. Just iterate over the
<Partition, E> entries, and nominate the key Partition as the "canonical"
one for the value E. Seems good enough.
Original comment by jim.andreou
on 4 May 2011 at 6:59
Not true! After a combine() is performed, the only way to tell the
Partition instances apart is object identity, which Multimaps.index()
doesn't use. Try it yourself: once Partitions have been combined, they're
equal as far as Object.equals() and Object.hashCode() are concerned, which
is all Multimaps.index() cares about.
Louis
Original comment by wasserman.louis
on 4 May 2011 at 8:33
I don't understand who you are responding to and what is "not true", could you
clarify?
Original comment by jim.andreou
on 4 May 2011 at 8:53
Sorry, I tried to respond in an email instead of in the comment interface.
> The calculation is done, and the Partition Equivalence would work. However,
there
> are still 1M Partition instances, one per node. From the information I have,
I
> cannot construct a Function<E, Partition> that is "normalized". So
Multimaps.index()
> wouldn't be useful.
Not true! After a combine() is performed, the only way to tell the
Partition instances apart is object identity, which Multimaps.index()
doesn't use. Try it yourself: once Partitions have been combined, they're
equal as far as Object.equals() and Object.hashCode() are concerned, which
is all Multimaps.index() cares about.
Louis
Original comment by wasserman.louis
on 4 May 2011 at 9:45
[slaps forehead, utters "Doh!", reaches for beer...]
Original comment by ray.a.co...@gmail.com
on 6 May 2011 at 9:57
@jim.andreou, does it matter if there is a race between two threads
comparing/updating a rank? If I understand the code correctly this would
result in a suboptimal tree, but still a valid one.
Original comment by fin...@gmail.com
on 21 May 2011 at 12:03
Yes, it matters, eg two threads might try to link a partition to different
partitions, and one would overwrite the other. I think the only way this could
be used in a multithreaded environment would be grabbing a common lock. (Even
locking the partitions themselves, say in Ordering.arbitrary() order, wouldn't
help).
Original comment by jim.andreou
on 21 May 2011 at 6:20
[deleted comment]
[deleted comment]
As far as I can see the only way to get a *corrupted* tree (as opposed to a
suboptimal one) is that two nodes should have a common ancestor but do not
(e.g. because thread1 sets a.parent = b and thread2 (not seeing thread1's
change) sets a.parent = c.
There is no danger of spurious merges, because if two partitions are *not*
meant to be merged then the merge method never sees pointers to both at once.
All you need to ensure is that before a.merge(b) returns, a and b have a common
ancestor (and that all threads agree on the fact.) Anything else is just
optimization. I cannot see any way in which this constraint could be violated
that would not be solved by a compareAndSet() loop.
E.g. this should work:
public ConcurrentPartition merge(ConcurrentPartition other) {
ConcurrentPartition root1, root2;
while ((root1 = other.find()) != (root2 = find())) {
if (root1.rank < root2.rank) {
if (root1.parentRef.compareAndSet(root1, root2)) {
return root2;
}
} else if (root1.rank > root2.rank) {
if (root2.parentRef.compareAndSet(root2, root1)) {
return root1;
}
} else {
if (root2.parentRef.compareAndSet(root2, root1)) {
root1.rank++;
return root1;
}
}
}
return root2;
}
The worst that could happen is that each thread only performs only one call to
merge() and no thread sees any other thread's updates to rank (easily simulated
by making rank ThreadLocal). Then you end up with a singly-linked list, but
that would still give you the right answers for equals() and hashCode() and the
path compression would still get a chance to reduce the depth.
You might need global synchronization at a higher level, to ensure that all
merge()s have completed before you start performing comparisons with equals()
and hashCode(), but that does not affect the implementation of merge().
Feel free to prove me wrong.
Original comment by fin...@gmail.com
on 22 May 2011 at 12:14
A.rank is 1
B.rank is 0
C.rank is 1
Thread1:
merge(A, B) - computes that B.parent should point to A
Thread2:
merge(B, C) - computes that B.parent should point to C
If these ran concurrently, each thread is free to write its own opinion,
ignoring (perhaps without even seeing, due to lack of visibility guarantees)
the other thread's update. Eg the 'second' thread would update B thinking it
the root, instead of updating the new root.
Original comment by jim.andreou
on 22 May 2011 at 2:54
No problem there with the compareAndSet implementation:
Thread1 computes that B.parent should point to A (and the previous root is B)
Thread2 computes that B.parent should point to C (and the previous root is B)
Now the following are executed in some order:
Thread1: B.parentRef.compareAndSet(B, A)
Thread2: B.parentRef.compareAndSet(B, C)
At least one will fail, and will go round the loop again & find the new root
e.g. if Thread2 was attempting merge(B, C) it will now attempt merge(A, C)
instead.
The result will be
A
/ \
B C
which is correct.
What appears to be more problematic is:
A.rank == B.rank
Thread1: merge(A, B)
Thread2: merge(B, A)
Thread 1 determines that A.parent should point to B, while thread 2 determines
that B.parent should point to A.
Now both threads perform their updates, creating a cycle (compareAndSet does
not prevent this, since the pointers are not synchronized with each other.)
But even then, the find() loop terminates (thanks to the compression step.) I
have tested this with random interleaving (i.e. inserting sleeps with random
delays) and still always got the right answer so far.
Original comment by fin...@gmail.com
on 22 May 2011 at 3:19
OK I found a way to break that method.
First create a cycle [A->B->C->D->A], e.g. by calling {A.merge(B); B.merge(C);
C.merge(D); D.merge(A)} in parallel.
Now call A.find() in thread1 and D.find() in thread2.
Here is one possible execution order:
- thread1 observes A->B->C
- thread2 observes D->A->B
- thread1 sets A.parent = C
- thread1 observes B->C->D
- thread1 sets B.parent = D
- thread1 observes C->D->A
- thread1 sets C.parent = A
- thread2 sets D.parent = B
- thread2 observes A->C->A
- thread2 sets A.parent = A and returns A
- thread1 observes D->B->D
- thread1 sets D.parent = D and returns D
Now the cycle has been split into two (C->A and B->D)
Original comment by fin...@gmail.com
on 22 May 2011 at 9:22
By the way, I'd like to share a curious usability issue of this class (well,
not exactly of *this* class, as you'll see, but a wart nevertheless).
If a user object doesn't share the same lifecycle as its Partition object, then
the user has to fallback into using a Map<E, Partition>. Now, if the user can
iterate over all the elements beforehand, he can eagerly initialize that Map
with Partition objects. But it is seems common to have a list of edges (element
pairs) instead to be merged, so the user that ends up with Map<E, Partition>,
will probably end up with lazily initializing that map too. The problem is we
don't have a concise way to have a default-value'd map -
MapMaker#makeComputingMap easily takes 5-6 lines. And the return type is not
that satisfying: ConcurrentMap<E, Partition>. I played with the idea of adding
a method "newPartitionMap()", returning a (computing) Map<E, Partition>,
internally using the MapMaker with concurrencyLevel==1. But Map<E, Partition>
is a poor type for a computing map. Function<E, Partition> (Kevin's suggestion)
is more correct, in that it doesn't expose a surprizing Map, but then if you
want to iterate in the end your E's (e.g. to see which partition they ended up
into), code using a Function<E, Partition> would also be quite verbose, since
an extra Set<E> would be needed.
I wonder what the community thinks of this issue.
Original comment by jim.andreou
on 22 May 2011 at 9:38
I'm leaning towards being conservative here: I think almost all of this
discussion is getting ahead of itself. I think the current version of
Partition fulfills most users' needs -- I've actually used this version for
several things myself without any further changes -- and I'm very strongly
inclined to wait until actual use cases occur in the wild for concurrency
support, lazy initialization, etc.
Original comment by wasserman.louis
on 22 May 2011 at 11:56
Completely agreed about concurrency. Regarding lazy initialization, well, I've
seen it in three cases "in the wild" (of googleland), and at least two really
required it (think an incoming stream of edges, of an unknown set of nodes -
you don't want to buffer/iterate the edges twice to discover/initialize the
nodes).
Some people don't like the verbosity of user code doing a lazy initializing
Map<E, Partition> either manually or by a computing map. Hard to satisfy this
crowd. In my opinion, judging Partition negatively on this respect is rather
unfair, since it isn't its job to offer a proper lazy-initializing map (imagine
if any class that might be used together with a lazy-initializing map, was
hard-coding one of these internally to shave few lines of user code - that
would be solving the same problem again and again, all over the place, instead
of once). Yet, there is some other internal union/find implementation that does
just that (basically something like a Partition<E> together with a lazy
initializing LinkedHashMap<E, Partition<E>>), and people prefer the resulting
user code, effectively rewarding a bad data structure design choice :) such is
life some times...
Original comment by jim.andreou
on 23 May 2011 at 12:21
Original comment by kevinb@google.com
on 13 Jul 2011 at 6:18
Original comment by kevinb@google.com
on 16 Jul 2011 at 8:32
We never got around to this. Should it be revived?
Original comment by wasserman.louis
on 20 Sep 2011 at 6:50
Oh. I had a dormant change list, forgot about it. The main problem in user code
is not related to Partition - it's that it's just Too Hard (TM) to make a
default value map in java (even with guava, unless that default value happens
to be a collection).
To take some verbosity out, basically, I meant to introduce a
Partition#newPartitionMap(), returning a Map<E, Partition>, where the user
wouldn't do the little "is the key in the map? if not, add this entry" dance.
Kevin had proposed a Function<E, Partition> instead of default-valued map, but
this wouldn't do since a common operation turned out to be accessing the
keySet() of said map.
It was a complicated discussion, taking more time than it was worth it, so I
put it in the backburner.
Original comment by andr...@google.com
on 20 Sep 2011 at 6:57
I actually have some thoughts on this I will try to pull together soon.
Original comment by kevinb@google.com
on 21 Sep 2011 at 1:35
Original comment by fry@google.com
on 10 Dec 2011 at 4:01
Proposal:
Introduce a Partitioning<E> class that manages the "default-valued map" details
internally. If we want to support key-set iteration, we can do that, but
manage it internally.
Partitioning<E> supports Partition getPartition(E e), which returns an object
of the Partition type we'd been playing with earlier. Additionally supports
inSamePart(E, e), and all of that.
For the moment, Partitioning locks the whole thing when doing any mutations.
If we design a more effective concurrent version of the structure later, then
that's great, but we guarantee thread-safety now (with a qualifier about
performance) and try to optimize later.
Original comment by wasserman.louis
on 17 Jan 2012 at 3:54
For reference: Wikipedia links to
http://citeseer.ist.psu.edu/anderson94waitfree.html, "Wait-free Parallel
Algorithms for the Union Find Problem." It sounds, well, pretty much exactly
like what we wanted?
Original comment by wasserman.louis
on 17 Jan 2012 at 4:34
Original comment by kevinb@google.com
on 30 May 2012 at 7:43
Original issue reported on code.google.com by
ogregoire
on 12 Jan 2011 at 11:58