benchflow / analysers

Spark scripts utilised to analyse data and compute performance metrics
Other
0 stars 1 forks source link

Document the Why and the Use of PartitionPerCore #96

Open VincenzoFerme opened 8 years ago

VincenzoFerme commented 8 years ago

As per in the title, reference to https://github.com/benchflow/analysers/pull/90.

Discuss also about: how to specify a custom setting for each of the scripts, using a map and why would it be useful to improve performance.

Cerfoglg commented 8 years ago

@VincenzoFerme

Partitioning in Spark is how the work is distributed across the various executors available to it (often being the cores available to the cluster). To fully make use of Spark's parallelization, and thus use all the resources available to it, it's important to partition the RDDs in order to make the most out of them, which means ideally having each executors performing enough tasks so each core is used to its fullest, but not so many tasks that it ends up wasting time.

For more about partitioning the Spark documentation is a good place to start, in particular https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-partitions.html

As for using a configuration file for specifying partitioning, it all depends on how the partitioning is carried out in the scripts, because it is entirely possible that the best level of optimisation can only be achieved by manually repartitioning to different degrees for each operation, so having a configuration would be difficult if every script could require arbitrary configurations depending on what it does. I would focus on optimising the individual scripts, and the common functions to all of them, manually in this regard.