CODAIT / r4ml

Scalable R for Machine Learning
Apache License 2.0
42 stars 13 forks source link

Use spark.default.parallelism to calculate default number of partitions #45

Closed bdwyer2 closed 6 years ago

bdwyer2 commented 7 years ago

Issue #44

We should use the Spark variable spark.default.parallelism instead of our custom function r4ml.calc.num.partitions() to calculate the number of partitions when converting a data.frame to r4ml.frame

From the Spark documentation:

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:

  • Local mode: number of cores on the local machine
  • Mesos fine grained mode: 8
  • Others: total number of cores on all executor nodes or 2, whichever is larger
R4ML-CI commented 7 years ago

Build triggered.

R4ML-CI commented 7 years ago

Build success. All unit tests passed.

bdwyer2 commented 7 years ago

@aloknsingh if the user sets master = local[3] even though they have 8 cores available then defaultParallelism will be 3 but we will assume that we have all 8 cores available.

I think at the very least we should use defaultParallelism instead of parallel::detectCores().

R4ML-CI commented 7 years ago

Build triggered.

R4ML-CI commented 7 years ago

Build success. All unit tests passed.

R4ML-CI commented 6 years ago

Build triggered.

R4ML-CI commented 6 years ago

Build success. All unit tests passed.