Open chunyang-wen opened 6 years ago
Hi, we are actually working on a distributed version recently. But we don't want to make it public before it is stable to use, so maybe it still needs some time to reach there. BTW, could I know your detailed scene(dataset size, models, and the training time) so that we might prioritize its support in advance. Thanks.
Another work related to this is the single node with multiple GPUs, it shouldn't be too far away. I think it might appear in the next several coming releases.
Thanks for your quick response. The speed of dynamically constructing graph is really handy. Currently we are trying to train a model with billions of instances. Each instance has a timestamp, so we can not just simply do data parallelism. There are ways to divide data into irrevelant batches. But instances share common weight. As distributed training is not supported, training speed is kind of slow. For about 20 million instances, batch size=10000, it takes at least 4 hours. model size is related to some unique id in instance. The model is just a simple logistic regression model variation.
Any upcoming timeline about distributed training support? I have noticed your answer on zhihu. For your reference
Good to know your case. But currently, we don't have a specific plan to release that. I will leave comments under this thread if we have updates. BTW, which zhihu post are you referring to?
I am making a little progress then. I decide to use ps-lite
, a distributed KV store. I am adding it as a submodule of dynet
and now it compiles.
When to update values (PUSH to or PULL from server) ?
It's better in trainer's update function.
A new function will be added to trainer to mark the end of training and pull latest data. There are two things that I need to solve:
ps-lite
part, it's better that it can support string key and sparse update.One request: if possible, make this configurable. There are a number of algorithms for distributed training (distributed synchronous SGD, distributed async SGD aka HogWild, etc.), and there are a number of transport layers that could be used here (MPI, custom shared memory things on a single machine, zillions of variants of parameter servers). Let's try to design so we can be forward compatible with variations that are likely to be tried.
I would be interested in this as well. For example, Caffe-MPI demonstrates nearly perfect scaling in this paper, significantly better than the frameworks (tensorflow, mxnet, cntk), but dynet seems better suited to my interests.
@xunzhang Any chance there have been updates on your distributed version?
I was looking at https://github.com/horovod/horovod/ and it possibly has potential.
Data is increasing dramatically. Distributed training is a trend. I wonder if there is any plan to support this.