RevolutionAnalytics / dplyr-spark

spark backend for dplyr
48 stars 18 forks source link

implement sample_n and sample_frac #2

Open piccolbo opened 9 years ago

piccolbo commented 9 years ago

as hadley refused https://github.com/hadley/dplyr/issues/592

n = 
sample_n
 select * from tbl order by rand() limit n
bernoulli (approximation for sample_frac)
select * from tbl where rand() < frac

Obviously we can only do it for the sparkSQL source