NICTA / scoobi

A Scala productivity framework for Hadoop.
http://nicta.github.com/scoobi/
482 stars 97 forks source link

Try improve DList.shuffle for collisions #245 #247

Closed espringe closed 11 years ago

espringe commented 11 years ago

Pull with caution, as I'm not sure how util.Random.shuffle will play with the single-pass iterator. If you run the unit tests on the cluster mode, it should fail if it doesn't work though xD

espringe commented 11 years ago

^ Strange test failure. Unrelated?

blever commented 11 years ago

I'm pretty sure you will hit the single-pass iterator problem here. To do that shuffle, you really need to bring all the values for a given partition into memory (which the code isn't explicitly doing) and this could also be very large such that it won't fit into memory anyway.

I think your update is attempting to tackle two things: dealing with collisions, and removing the overhead of a synthesised key in the sort-and-shuffle.

In my mind, I don't see how collisions can be a problem. Agree that they will occur but the collisions will be between random sets of elements. Therefore I don't think it matters. Saying that, I'm not a stats expert :)

Re, the overhead, I think this is a fair point. I like the idea of using () for the key with a custom Grouping. Could we extend your solution to implement sortCompare with a call to Random? Alternatively, could we simply make a trade-off and use a Char for the key, rather than an Int?

espringe commented 11 years ago

Yeah, that shuffle does bring everything into memory -- which I agree is pretty nasty.

Could we extend your solution to implement sortCompare with a call to Random

Nah, it'll only semi-shuffle things -- but very very far from randomly. There's been some pretty high profile bugs of people doing that though (the MS browser ballot failure, they used a sortBy(rand) ) [Basically most sorting algorithms are lazy, so things get moved the bare minimum assuming they had a total ordering]

However, if you're not interested in a statistical random shuffle (e.g. you just want to look at the data, and not see a particularly biased section) -- i think it would be very useful. Maybe we could call it shuffleFastAndPoorly which would send it to a random partition, and then used your described sortCompare (We would need to check though, that hadoop supports this. As some sorting algorithms can crash when they detect inconsistent total ordering)

Alternatively, could we simply make a trade-off and use a Char for the key, rather than an Int?

I didn't realize at the time, but Int is actually too small. The problem is, that you are virtually guaranteed key collisions (you only need like 80k elements, for a >50% chance of a collision) -- and during those collisions, the original order of the elements are being preserved. Its no big deal for almost all cases I can think of, but it feels really terrible, and is quite detectable.

So I kind of think its better to err on the side of caution and use a long/double again which has 64 bit keys, thus a lottttt less collisions (and virtually guaranteed to not happen on the same mapper, which is what is currently screwing up the Int, as that guarantees relative order being preserved ).

blever commented 11 years ago

Using Int, I agree that we will be virtually guaranteed key collisions, but the collisions are unlikely to occur for adjacent elements. Therefore, when they do occur it'll simply bring sort elements that were far apart close together, somewhat shuffling them. Yes, they will retain their initial relative ordering but maybe that could be mitigated by also using a random partitioner, although this solution down if you have few reducers (but if you have a lot of data you've probably got a lot of reducers).

So, maybe, the best thing would be:

What do you think?

espringe commented 11 years ago

Agree on all points. I'll do that as a PR tomorrow, or if you want to do it earlier feel free :D

(Well, I don't think .shuffleFastButPoorly would confuse anyone, but it is a bit ugly looking -- so your call, I'll leave it out)

On Wed, May 15, 2013 at 6:23 PM, Ben Lever notifications@github.com wrote:

Using Int, I agree that we will be virtually guaranteed key collisions, but the collisions are unlikely to occur for adjacent elements. Therefore, when they do occur it'll simply bring sort elements that were far apart close together, somewhat shuffling them. Yes, they will retain their initial relative ordering but maybe that could be mitigated by also using a random partitioner, although this solution down if you have few reducers (but if you have a lot of data you've probably got a lot of reducers).

So, maybe, the best thing would be:

  • synthesize a random Long key
  • use a random partitioner
  • just a single shuffle method - multiple methods will just confuse the user

What do you think?

— Reply to this email directly or view it on GitHubhttps://github.com/NICTA/scoobi/pull/247#issuecomment-17975886 .

blever commented 11 years ago

If you don't mind putting a PR together that would be great!

espringe commented 11 years ago

Actually, before I do -- what do you think of this approach: https://github.com/NICTA/scoobi/commit/b19d7dcf2bec0eb36f21967aa12c75cccff41be9

Basically, instead of bringing everything into memory and shuffling it in the reducer -- it only shuffles the collisions. Assuming the single-iterator doesn't bite us, I feel like this is an optimal approach, we can get away with a 32 bit key, instead of a 64 bit one. When deciding which partition to use, we only needed 32 bit of randomness created, and most importantly its nice and simple, and very clear what it does

espringe commented 11 years ago

And here is a kind of fun way of solving it too:

https://github.com/NICTA/scoobi/commit/85f98ada7ed78ea16883325ea74c2de974ec8c3b

Basically send to a random partition, and in the case of a key-collision use the sort to semi-randomize it. It'll definitely handle 2 collisions fine, 3 I think. But 4 or more, there will bias problems (But it seems to me, that the probability of getting 4 key-collisions in a reducer is so remote, that its not even worth considering)

I did some checking, and hadoop has no problem with abusing the comparator like this -- and doesn't throw any errors or anything. But it is perhaps an implementation detail, that might not be too wise to abuse. But this is my favourite approach :D

blever commented 11 years ago

I think that's a good trade off. Are we certain the correct Grouping[Int] will be picked up? Could use groupByWith?

On 17/05/2013, at 12:51 AM, Eric Springer notifications@github.com wrote:

And here is a kind of fun way of solving it too:

85f98ad

Basically send to a random partition, and in the case of a key-collision use the sort to semi-randomize it. It'll definitely handle 2 collisions fine, 3 I think. But 4 or more, there will bias problems (But it seems to me, that the probability of getting 4 key-collisions in a reducer is so remote, that its not even worth considering)

— Reply to this email directly or view it on GitHub.

espringe commented 11 years ago

Too easy. Closing this, and will do as a separate PR