ganler / ResearchReading

General system research material (not limited to paper) reading notes.
GNU General Public License v3.0
20 stars 1 forks source link

OSDI'20 | AntMan: Dynamic Scaling on GPU Clusters for Deep Learning #44

Closed ganler closed 3 years ago

ganler commented 3 years ago
  1. Paper: https://www.usenix.org/system/files/osdi20-xiao.pdf
  2. Video: https://www.youtube.com/watch?v=8PSzcqL0eUA
  3. Gaocegege's WeChat blog;

OSDI’20 | AntMan Previous dynamic scaling works often build on a per-device level. This work looks at a multi-tenant scenario where multiple training jobs may share one physical GPU with dynamic scaling. AntMan re-implemented the functions of TensorFlow to re-define behaviors of (i) kernel dispatcher and (ii) memory allocator; e.g., the kernel dispatcher will pack consecutive kernel functions from one job to reduce performance interference of different jobs; For idle jobs, the manager will shrink the GPU memory for other jobs; This work also regards jobs in 2 types: (1) resource-guarantee job (performance first) and (2) opportunistic jobs (utilization first);