discoproject / disco

a Map/Reduce framework for distributed computing
http://discoproject.org
BSD 3-Clause "New" or "Revised" License
1.63k stars 241 forks source link

Ability to Designate Reduce-only Nodes (or Map-only Nodes) #612

Open tigerite opened 9 years ago

tigerite commented 9 years ago

Hello

I am looking for a way to set certain nodes in my cluster as "reduce only" nodes, ie. nodes that are only available for executing the reduce stage of jobs.

Conversely, you can have the option to set "map only" nodes, ie. nodes that are only available for executing the map stage of jobs.

In my cluster, I have two kinds of servers: one set of high performance servers for executing the heavy computations in the map stage, and another set of lower performance servers suitable for executing the less complex reduce stage.

So I don't want the high performance servers to be wasted executing the reduce stage of my jobs.

Disco 0.5.4 does not have this feature. So if someone can point me to where in the code the logic is for selecting the node to execute the reduce stage of a job, it will be greatly appreciated.

I don't believe this should be complex to add:

  1. Add configuration settings for designating reduce-only and map-only nodes.
  2. When selecting a node for either stage, the disco master selects a node that falls in one of the designated set.

Thanks in advance!

pooya commented 9 years ago

Hi, the code that chooses a node is available at job_coordinator:do_submit_tasks_in. This might be overridden later based on the node availability.

Please note that this type of cluster is not very common. If the nodes are not uniform, you can already set the number of workers per node. Moreover, the idea is to push computation to the data. If a map is performed on a node, the output of the map will be on the same node and it makes sense to run reduce on the same node to avoid shipping the data to another node.

tigerite commented 9 years ago

Thank you pooya!

I think the best solution in this case is to not even have a reduce function (stage).

Can you confirm that Disco will just return the results of the map function to the master without doing any NOP shuffling and reducing?