ddf-project / DDF

Distributed DataFrame: Productivity = Power x Simplicity For Scientists & Engineers, on any Data Engine
http://ddf.io
Apache License 2.0
168 stars 42 forks source link

Use Spark DF/RDD APIs for sampling without replacement instead of SQL #348

Closed nhanitvn closed 8 years ago

nhanitvn commented 8 years ago

Description and related tickets, documents

Total size of serialized results of 793 tasks (2.0 GB) is bigger than spark.driver.maxResultSize (2.0 GB)

Reviewers: @hai-adatao @phvu @Huandao0812

Breaking changes & backward compatible issues

No

How to test

PR Progress

Make sure all checkboxes below are checked before merged

nhanitvn commented 8 years ago

retest this please

nhanitvn commented 8 years ago

retest this please

hai-adatao commented 8 years ago

Lots of PE and PyClient / RClient tests failed after this merge: https://ci.arimo.com/job/BE-CI-PE-test/1012/ https://ci.arimo.com/job/BE-CI-RClient-test/232/ https://ci.arimo.com/job/BE-CI-PyClient-test/290/ I'm gonna revert this