Towards Effective Resource Utilization over Multi-tenant Large-scale Machine Learning Platform

gaocegege commented 6 years ago

https://102.alibaba.com/fund/proposalTopicDetail.htm?id=1121

Alibaba Fund 主题

gaocegege commented 6 years ago

Inter-workload optimization

There are two perspectives for saying inter-workload optimization. Firstly, it is expected that multiple workloads could be merged/combined as a single logical task and scheduled onto a single computation device to exploit the underlying hardware potential. Virtualization is one way for solving this challenge but not the only one.

Secondly, a typical machine learning workflow usually consists of multiple tasks, each with its own characteristic, a holistic and global view for optimizing the execution and resource usage for those multiple tasks is necessary since with bigger view more optimization space could be discovered. Also a uniform language or intermediate representation may be necessary for describing the execution nature of multiple tasks in a common way and based on this IR, aggressive optimization strategy can be taken. WeldIR proposed by Stanford and MIT is a motivating example but it is still at a quite early stage.

Priority based machine learning task scheduling

As mentioned earlier, not all jobs are created equal. So priority scheduling is necessary to ensure incoming workloads with higher priority will not be starved for quite a long time due to that the cluster resource is fully occupied by the earlier started lower priority tasks. Preemption scheduling is not as easy as it sounds for machine learning tasks.

Compiler oriented distributed optimization

We view the resource utilization problem of large-scale machine learning platform as a typical compiler oriented problem. Since what needs to be solved is to bridge the gap between the high-level machine learning workloads description and the high-performance execution of underlying hardware. The strategy of mapping distributed execution requests onto various heterogeneous computing devices can be regarded as a typical graph optimization problem. To be more aggressive, automating the process of transforming a single machine learning task description into a distributed execution plan can also be viewed as a graph optimization problem. And to ensure the high-level graph optimization is efficient enough, it’s also necessary to solve the low-level code generation problem. Since with graph optimization taking into effect, the boundary between different components are broken, thus bigger optimization space is exposed, but also the existing implementation of those components themselves can not be reused, so code generation techniques are necessary to generate execution for the bigger picture. It is hopefully that the code generation work can be achieved in a principal way rather than a case-by-case one.

gaocegege commented 6 years ago

第一点和第二点都是我非常感兴趣的方向，第三点怕不是 https://github.com/dmlc/tvm

gaocegege commented 6 years ago

第三点应该是 https://github.com/gaocegege/papers-notebook/issues/101 这样的

dyweb / papers-notebook

Towards Effective Resource Utilization over Multi-tenant Large-scale Machine Learning Platform #98