actionml / template-scala-parallel-universal-recommendation

30 stars 21 forks source link

using spark.default.parallelism to repartition eventsRDD #4

Closed ghost closed 8 years ago

ghost commented 8 years ago

By repartitioning the eventsRDD, I was able to greatly reduce my overall training time when using more than 3 executors. Stage 0 takes more time now while the repartitioning happens, but subsequent stages run much faster for me, and I'm able to spread the load out across the cluster better. I can train with 25mm events in about 15 minutes total.

pferrel commented 8 years ago

looks pretty straight forward, thanks. We have a bunch of datasets and configs to try it out on.

ghost commented 8 years ago

I wish I knew how the initial number of partitions are chosen. SparkContext has defaultMinPartitions, but I don't see any way to set it globally or apply it to newAPIHadoopRDD. But I believe that would avoid some extra I/O in Stage 0.