CODAIT / r4ml

Scalable R for Machine Learning
Apache License 2.0
42 stars 13 forks source link

Use spark.default.parallelism to calculate default number of partitions #44

Closed bdwyer2 closed 6 years ago

bdwyer2 commented 7 years ago

We should use the Spark variable spark.default.parallelism instead of our custom function r4ml.calc.num.partitions() to calculate the number of partitions when converting a data.frame to r4ml.frame

From the Spark documentation:

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:

  • Local mode: number of cores on the local machine
  • Mesos fine grained mode: 8
  • Others: total number of cores on all executor nodes or 2, whichever is larger