Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

dyweb / papers-notebook

:page_facing_up: :cn: :page_with_curl: 论文阅读笔记（分布式系统、虚拟化、机器学习）Papers Notebook (Distributed System, Virtualization, Machine Learning)

https://github.com/dyweb/papers-notebook/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+-label%3ATODO-%E6%9C%AA%E8%AF%BB

Apache License 2.0

2.13k stars 244 forks source link

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads #125

Open gaocegege opened 5 years ago

gaocegege commented 5 years ago

https://arxiv.org/pdf/1901.05758.pdf

gaocegege commented 5 years ago

来自微软的工作，这篇文章是研究了以下三个问题对 DNN 训练的工作负载的调度的影响：

Gang scheduling 与 locality constraints on queueing 的影响（相关工作：kube-batch 等）
locality 对 GPU 利用率的影响
训练时候的 failure

作者根据这些提出了一些设计的 guidelines，来指导下一代为 DNN 训练设计的调度器。

gaocegege commented 5 years ago

作者根据自己的经验，提出了三个值得注意的点，这种点我觉得我上我也行

locality 很关键
在同一个机器上分享 GPU 的不同任务可能会相互干扰
许多错误应该被早点捕捉出来，比如通过 profiling 等方式

We plan to release traces used for our study and hope that insights and data from our study inform the burgeoning work of scheduling research for machine learning workloads. （求你快一点）

gaocegege commented 5 years ago

本文针对的工作负载是用 TF，PyTorch，Caffe，MXNet 等框架进行的 LSTM，CNN 等模型训练。在分布式中，采取的数据并行。AllReduce 和参数服务器的更新方式都是支持的。

default

本文的调度是基于 Yarn 的，跟其他的调度器的比较如图所示：

default

gaocegege commented 5 years ago

剩下的内容就是通过实验来验证上面说的三点，以及提出一些 guidelines，这里就不说了，具体见论文

at15 commented 5 years ago

@gaocegege 你行你上呀

gaocegege commented 5 years ago

我这不是弃研究从工业界了么