We should use the Spark variable spark.default.parallelism instead of our custom function r4ml.calc.num.partitions() to calculate the number of partitions when converting a data.frame to r4ml.frame
For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger
We should use the Spark variable
spark.default.parallelism
instead of our custom functionr4ml.calc.num.partitions()
to calculate the number of partitions when converting adata.frame
tor4ml.frame
From the Spark documentation: