In large-scale replication, we have to get data from several data nodes at once. The problem which I have seen occurs if one data node is slower than another, or if one has much bigger files. When the number of parallel downloads becomes less than max_parallel_download, Synda shows no preference between the slow and fast data node. But files are finished much more frequently if they come from the fast one. So what happens in practice (and my observations) is that all or most of the parallel downloads come from the slower data node. The fast one has to wait for the slow one to finish. This greatly reduces throughput.
The solution is to have a parameter max_parallel_download_per_datanode. Typical values are 4 or 8. No data node is allowed to have more than that number of simultaneous downloads. If the present parameter max_parallel_download is larger (I use 50), then many or all servers can be active at once.
I have implemented this feature and submitted it as a pull request.
In large-scale replication, we have to get data from several data nodes at once. The problem which I have seen occurs if one data node is slower than another, or if one has much bigger files. When the number of parallel downloads becomes less than max_parallel_download, Synda shows no preference between the slow and fast data node. But files are finished much more frequently if they come from the fast one. So what happens in practice (and my observations) is that all or most of the parallel downloads come from the slower data node. The fast one has to wait for the slow one to finish. This greatly reduces throughput.
The solution is to have a parameter max_parallel_download_per_datanode. Typical values are 4 or 8. No data node is allowed to have more than that number of simultaneous downloads. If the present parameter max_parallel_download is larger (I use 50), then many or all servers can be active at once.
I have implemented this feature and submitted it as a pull request.