discoproject / disco

a Map/Reduce framework for distributed computing
http://discoproject.org
BSD 3-Clause "New" or "Revised" License
1.63k stars 241 forks source link

Heavy DDFS load can cause temporary failures in jobs #191

Open tuulos opened 14 years ago

tuulos commented 14 years ago

If DDFS / network is being loaded excessively while demanding jobs are running, nodes may start dropping which causes lots of temporary failures in jobs.

There should be ways to balance QoS between Disco and DDFS, de-prioritizing the latter, so that network capacity would be more fairly utilized.

tuulos commented 14 years ago

14:33 @tuulos maybe an easy starting point would be to implement somekind of throttling for ddfs 14:33 @tuulos based on load on disco, for instance 14:35 @tuulos a simple approach might be to instrument ddfs_put:receive_body and ddfs_get:send_file, measuring the bytes moved per second, and adding delay as necessary 14:36 @tuulos currently we guard the number of simultaneous connections very effectively, preventing an excessive number of conenction being opened 14:36 @tuulos but any individual connection may load the system as much as it can 14:36 @tuulos that could be changed by throttling