Use spark.default.parallelism to calculate default number of partitions

bdwyer2 commented 7 years ago

Issue #44

We should use the Spark variable spark.default.parallelism instead of our custom function r4ml.calc.num.partitions() to calculate the number of partitions when converting a data.frame to r4ml.frame

From the Spark documentation:

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:

Local mode: number of cores on the local machine

Mesos fine grained mode: 8

Others: total number of cores on all executor nodes or 2, whichever is larger

R4ML-CI commented 7 years ago

Build triggered.

R4ML-CI commented 7 years ago

Build success. All unit tests passed.

bdwyer2 commented 7 years ago

@aloknsingh if the user sets master = local[3] even though they have 8 cores available then defaultParallelism will be 3 but we will assume that we have all 8 cores available.

I think at the very least we should use defaultParallelism instead of parallel::detectCores().

R4ML-CI commented 7 years ago

Build triggered.

R4ML-CI commented 7 years ago

Build success. All unit tests passed.

R4ML-CI commented 6 years ago

Build triggered.

R4ML-CI commented 6 years ago

Build success. All unit tests passed.

CODAIT / r4ml

Use spark.default.parallelism to calculate default number of partitions #45