Data partition issue when distributed training

amznero commented 4 years ago

Hi there,

Recently, when I use dist_train.py (examples/tf/graphsage/dist_train.py) to test distributed mode, I found that the number of iterations of each worker is quite different(in my opinion, it should be roughly equal for each worker).

Problem Detail:

The PPI dataset containing 56,944 nodes and 818,717 edges. Suppose I set the batch size to 100, so there should be 570 iters(node-based sampler) in one-epoch-training-scheduler.

When I use 1 ps with 2 workers configure for distributed training, worker-0 runs 284 iters, but worker-1 runs 856 iters, the sum iters of two workers is *1140(2 570)**. The data appears to have been accessed twice in one epoch.
When I use 2 ps with 1 worker configure(unusual setting, just for the experiment) for distributed training, worker-0 only runs 285 iters(570 / 2). Similarly, when I use 4 ps with 2 workers, worker-0 runs 143 iters, and worker-1 runs 143 iters. Half of the data is not used.

Some thoughts

After I read some source codes, I guess problem-1 may be caused by node_getter.cc(graphlearn/core/operator/graph/node_getter.cc)'s shared state.

When a client sends a node getter request, the NodeGetter OP will lock the DataStorage, so there will not be multi-requests reading the same data(thread-safety).

But when the cursor reaches the end of data, it will raise an OutOfRangeError to the client, and reset the cursor to 0. But for other workers which also connected to the same server, it will not receive the OutOfRangeError signal, so when they build a new request to get nodes, the shared state's cursor is already re-init to 0. So like problem-1's result, when worker-0 reaches the end of data, it will receive an OutOfRangeError and the server will reset the cursor to 0. Then, worker-0 will finish the training process. For worker-1, it will re-start from 0 to 569 to traverse the whole dataset.

Summary:

When the number of workers is greater than the number of servers(ps), the server will reset shared state many(the number of workers connected to the server) times. In fact, the reset times should only depend on epoch-setting. Finally, it will cause the number of iterators is actually greater than expected.
When the number of workers is lesser than the number of servers(ps), the nodes will partition to n_server parts(by hash, for now), but only the number of workers parts will be used in training. It may be caused by each client is only connected to one fixed server?

jackonan commented 4 years ago

Thanks for your feedback. This is a known problem currently and we are doing the fix. The iteration cases that multi clients connect to the same server need to be designed carefully, with both correctness and performance guaranteed. For the cases that the number of workers is less servers, we hope your case sharing. It scarely happens through our experience.

amznero commented 4 years ago

Oh, the situation that the number of workers is fewer than servers is just an experiment for understanding your implementation😂.

In general, num_workers is greater than num_servers.

For now, I use num_workers = num_servers setting for training, to avoid the above problem.

alibaba / graph-learn