Update the way we do sampling without replacement

ddf-project / DDF

Distributed DataFrame: Productivity = Power x Simplicity For Scientists & Engineers, on any Data Engine

http://ddf.io

Apache License 2.0

167 stars 42 forks source link

Update the way we do sampling without replacement #333

Closed nhanitvn closed 8 years ago

nhanitvn commented 8 years ago

Description and related tickets, documents

due to the fact that SparkSQL does not allow rand() to accept a negative seed value in order by clause

Reviewers: @hai-adatao

Breaking changes & backward compatible issues

How to test

Describe how this PR is tested. In case manual testing is required, describe how to do so.

PR Progress

Make sure all checkboxes below are checked before merged

[x] Branch is in format prefix/description (see this)
[x] Merge check has no conflicts. PR checks passed.
[x] Code review is done.

Change-Id: I34ebed02c4ff615aae99a30cd68c3e1104ee3dc0

hai-adatao commented 8 years ago

I don't really understand the fix, why this involves a sub query? Does it affect the performance of sample? Can we just disallow negative seed or deprecate it and workaround by shifting / getting abs value instead and create ticket for BA to migrate it?

On another thought, what does this seed do? Does it guarantee that we will get the same result if we use the same seed? If not can we just get its abs value instead?

nhanitvn commented 8 years ago

You cannot run this in SparkSQL "select * from table order by rand(-123)" while these sql work

select * from table order by rand(123)
select * from (select *, rand(-123) as rnd from table) t order by rnd The 2nd query does not affect the performance because Spark will generate just the same number of stages like the one with no subquery.

hai-adatao commented 8 years ago

We're fixing that by introducing a subquery, it will affect performance right? Also, my opinion is negative seed doesn't make sense, so we can just:

use abs(seed) before using it now, and create a ticket for BA to migrate negative seed
once BA is done, add error message to verify negative seed
or we can just disallow negative seed in DDF now, and do abs() stuff in client

If negative seed really makes sense, then I think do an if check and only on sub query when seed is negative is better

nhanitvn commented 8 years ago

I did a quick benchmark and the subquery does increase the running time (add 1/2x to x where x is the running time of the sql using no subquery). So I refactor to handle the negative seed case separately. Performing an abs() is not a good one, I think, though it is not that critical.