Paper Review: A comprehensive study of deep learning compiler bugs

Publisher

FSE'21

Link to The Paper

https://dl.acm.org/doi/pdf/10.1145/3468264.3468591

Name of The Authors

Shen, Qingchao, Haoyang Ma, Junjie Chen, Yongqiang Tian, Shing-Chi Cheung, and Xiang Chen

Year of Publication

2021

Summary

This paper was the first to analyze the deep learning compiler bugs over three DL compilers: TVM, Glow, and nGraph. They procured 603 bugs and analyzed them manually. Post the manual analysis, they discovered 12 root causes of Deep Learning Compiler bugs. Some root causes were similar to bugs in traditional compilers, such as incorrect assignment, exception handling, etc. However, some root causes were DL-specific, such as type problems (issues with tensor types computational node type problems). They also investigated the symptoms of deep learning compiler bugs and found that the most common symptoms were crashes, wrong code, lousy performance and so on. Moreover, they analyzed the occurrence of these root causes, symptoms and the correlation between the symptoms and root causes. Finally, they analyze which stage of DL compilation is more prone to bugs and the presence of commonality in the bugs for different DL compilers.

As a proof-of-concept, they designed a new technique called TVMFuzz, which applies differential testing and test oracle generation to detect 8 compiler bugs in the TVM Compiler.

Contributions of The Paper

The first systematic study on DL compiler bugs examined 603 bugs originating from three prominent and diverse DL compilers. A comprehensive classification of the underlying causes of DL compiler bugs was provided, along with an outline of the symptoms that these bugs tend to manifest.

In addition, an exploration of guidelines for the future detection and debugging of DL compiler bugs was presented. As a practical application of their findings, an initial proof-of-concept was executed by developing a testing tool for the TVM compiler. This tool successfully uncovered 8 bugs in the TVM compiler that were previously beyond the scope of detection for TVM's original test suite.

Comments

They had some interesting suggestions concerning the research directions for DL Compiler Bugs.

Handling types, especially tensor types, in DL compilers is very challenging and deserves more attention
The large percentage of crashes suggests the potential of augmenting the existing test suite with the generated ones.
More attention should be paid to designing effective testing methods for wrong code bugs.
Generating high-quality test oracles around the two symptoms (Crash and Wrong Code) can detect various DL compiler bugs.
Due to many DL compiler bugs induced by IR optimization, a variant of differential testing (i.e., Different Optimization Levels (DOL) method) may be adapted to detect optimization-related bugs, which compares the results under different optimizations.
The three most common DL Compilers have a high degree of commonality, making the design of any tool generalizable to all the compilers.

RAISEDAL / RAISEReadingList