MLSys'21 | Value Learning for Throughput Optimization of Deep Learning Workloads

Paper Summary

This paper leverages Reinforcement Learning (RL) to iteratively generate the seemingly optimal schedule for tensor programs. The computing model of domain-specific tensor computing languages like Halide and Tensor IR has separate modelling of the computing logic (program) and optimizations (scheduling). Given a program, tensor compilers aim at generating high-performance schedules targeting desired performance on a specific platform. Prior and recent autotuning work generates random schedules and uses learning-based techniques to predict the performance, therefore, the least-time-consuming schedule will be selected to generate target code. This paper, in addition to schedule mutation (e.g., change split sizes, reorder the loop structure, vectorized the loops, insert temporary buffers, and thread parallelism), uses RL to iteratively predict the sub-optimal schedule of the current stage given a group of scheduling candidates (i.e., beam search). The RL model will predict its runtime and the fastest schedule will be recorded. In this way, a pipeline with N stages, if we create M samples for each, the complexity is O(N x M) rather than O(M^N). In their feature engineering, there're 3 categories of features: 1) FLOPs, integer operations, and memory accessing patterns; 2) # vectorization, # unique cache lines accessed, # bytes read & written, etc; 3) derived feature based on original features (e.g., the ratio of vectorization instructions); The model used is a Bi-LSTM. From the experimental results, it outperforms TVM & Halide by 2.6× over Halide and 1.5× over TVM.

Strength

A group of detailed but still generic feature engineering allows the model to work even though the model is extremely simple.

Weakness

This method is only tested on 1 platform/hardware. More platforms should be tested.
Did not compare with the simplest baseline (the brute-force approach).
Should try to interpret the model: which feature is the most useful one. A plausible approach to achieve this is to do an ablation study.

ganler / ResearchReading