OSDI'20 | AntMan: Dynamic Scaling on GPU Clusters for Deep Learning

Paper: https://www.usenix.org/system/files/osdi20-xiao.pdf
Video: https://www.youtube.com/watch?v=8PSzcqL0eUA
Gaocegege's WeChat blog;

OSDI’20 | AntMan Previous dynamic scaling works often build on a per-device level. This work looks at a multi-tenant scenario where multiple training jobs may share one physical GPU with dynamic scaling. AntMan re-implemented the functions of TensorFlow to re-define behaviors of (i) kernel dispatcher and (ii) memory allocator; e.g., the kernel dispatcher will pack consecutive kernel functions from one job to reduce performance interference of different jobs; For idle jobs, the manager will shrink the GPU memory for other jobs; This work also regards jobs in 2 types: (1) resource-guarantee job (performance first) and (2) opportunistic jobs (utilization first);

ganler / ResearchReading

OSDI'20 | AntMan: Dynamic Scaling on GPU Clusters for Deep Learning #44