ddf-project / DDF

Distributed DataFrame: Productivity = Power x Simplicity For Scientists & Engineers, on any Data Engine
http://ddf.io
Apache License 2.0
168 stars 42 forks source link

Update the way we do sampling without replacement #333

Closed nhanitvn closed 8 years ago

nhanitvn commented 8 years ago

Description and related tickets, documents

due to the fact that SparkSQL does not allow rand() to accept a negative seed value in order by clause

Reviewers: @hai-adatao

Breaking changes & backward compatible issues

How to test

Describe how this PR is tested. In case manual testing is required, describe how to do so.

PR Progress

Make sure all checkboxes below are checked before merged

Change-Id: I34ebed02c4ff615aae99a30cd68c3e1104ee3dc0

hai-adatao commented 8 years ago

I don't really understand the fix, why this involves a sub query? Does it affect the performance of sample? Can we just disallow negative seed or deprecate it and workaround by shifting / getting abs value instead and create ticket for BA to migrate it?

On another thought, what does this seed do? Does it guarantee that we will get the same result if we use the same seed? If not can we just get its abs value instead?

nhanitvn commented 8 years ago

You cannot run this in SparkSQL "select * from table order by rand(-123)" while these sql work

hai-adatao commented 8 years ago

We're fixing that by introducing a subquery, it will affect performance right? Also, my opinion is negative seed doesn't make sense, so we can just:

If negative seed really makes sense, then I think do an if check and only on sub query when seed is negative is better

nhanitvn commented 8 years ago

I did a quick benchmark and the subquery does increase the running time (add 1/2x to x where x is the running time of the sql using no subquery). So I refactor to handle the negative seed case separately. Performing an abs() is not a good one, I think, though it is not that critical.