Closed ghost closed 8 years ago
looks pretty straight forward, thanks. We have a bunch of datasets and configs to try it out on.
I wish I knew how the initial number of partitions are chosen. SparkContext has defaultMinPartitions
, but I don't see any way to set it globally or apply it to newAPIHadoopRDD
. But I believe that would avoid some extra I/O in Stage 0.
By repartitioning the eventsRDD, I was able to greatly reduce my overall training time when using more than 3 executors. Stage 0 takes more time now while the repartitioning happens, but subsequent stages run much faster for me, and I'm able to spread the load out across the cluster better. I can train with 25mm events in about 15 minutes total.