Repartition vs coalesce

cerndb / dist-keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

http://joerihermans.com/work/distributed-keras/

GNU General Public License v3.0

623 stars 169 forks source link

Repartition vs coalesce #42

Closed raviolli closed 6 years ago

raviolli commented 6 years ago

In your trainer.py code you use repartition. I suggest coalesce as to not trigger shuffle and cause expense time spending.

trainer.py

        if shuffle:
            dataframe = shuffle(dataframe)
        # Indicate the parallelism (number of worker times parallelism factor).
        parallelism = self.parallelism_factor * self.num_workers
        # Check if we need to repartition the dataframe.
        if num_partitions >= parallelism:
            dataframe = dataframe.coalesce(parallelism)
        else:
       dataframe = dataframe.repartition(parallelism)

raviolli commented 6 years ago

.. cancelled ... it was a suggestion not a bug. but coalesce isn't that perfect either.