Optimizing Distributed Execution of Deep Learning System

gaocegege commented 6 years ago

https://102.alibaba.com/fund/proposalTopicDetail.htm?id=1120

Alibaba Fund 的主题

gaocegege commented 6 years ago

System-level optimization, in which by designing better pipeline execution 、 distributed placement strategy (which computation part should be placed into GPU, which on CPU, FPGA, etc.) and scheduling strategy, the computation and communication can be overlapped as much as possible, thus reducing idle time of hardware resources. It is ideal if the system scalability could be extended to more than thousands of heterogeneous high-performance GPGPU devices.
Optimization algorithm design, in which by tailoring training algorithm to better matching distributed execution scenario (for example, large-batch oriented training and gradient compression), the scalability could be further improved. For this research track, what we want to emphasize is the generalization of the proposed optimization strategy. Since given a specific model, it is not quite difficult for tuning a training procedure with better scalability, but as AI infrastructure provider, what we mostly care about is how general the training algorithm could be. Otherwise, a lot of algorithm tuning headache will be pushed to the user side, which is unacceptable from infrastructure perspective.
Design models friendly for distributed execution, in which it is expected that there are some models which are better suited for distributed execution, one of the example is lightRNN proposed by Microsoft Research Asia. And we are looking forward to principal way for designing models friendly for distributed execution.
Auto-parallel, with which the high-level user provided single-node model description could be automatically transformed into a distributed execution plan running efficiently on the heterogeneous clusters. We wish that with auto-parallel, modeling users don’t need to care about the underlying computation and communication hardware characteristic and they just need to focus on the model descriptions and let the deep learning engine decide the detailed distributed execution plan such as computation node placement、graph split, communication strategy, etc.

gaocegege commented 6 years ago

四个研究点，第一个有一篇论文 https://github.com/gaocegege/papers-notebook/issues/96 最近想看的。第二个点和第三个点我不大了解了，第四个点是一个非常有趣的研究方向。

dyweb / papers-notebook

Optimizing Distributed Execution of Deep Learning System #97