do(code): A Causal Inference Framework to Understand and Explain Source Code Properties

Description: The boom of machine learning and deep learning techniques in software engineering has been increasing in the last decade. A need for SE-related data, in specific code data, is also in constant demand since these data are the main source for learning algorithms to operate. Although learning algorithms are relevant to extract patterns from unstructured SE-data, the effectiveness of these algorithms is poorly understood and explained. The reason behind this problem is that software researchers mainly focus their attention to evaluate observational scenarios and they do not contemplate possible interventions in the data that might influence the outcome of the learning algorithm. These types of interventions enable something that in causal inference is known as counterfactual explanations. In our case, we are particularly interested in code interventions given a set of properties (e.g., complexity, size, or entropy). Often, SE Research questions about source code are not entirely descriptive or predictive questions. In some scenarios, these questions try to establish a causal relationship. Let’s look at the following example of a Research Question according to the function:

Description. What is there?: What is the average number of bugs found in distinct code sizes (e.g, small, medium, or large)?
Prediction. What will happen?: Can the size predict the number of bugs a developer will inject into the code? What is the correlation between the number of bugs and code size?
Causal Inference. What would happen if?: Does the size of a system generate the number of bugs developers inject in the SE lifecycle? Why does the number of bugs is caused by the size of a system? What would happen if the size of the system is constant? What would happen if the size of the system is skewed to low values?

The purpose of this study is to create a library that allows software researchers to evaluate the causal effect from one code property (i.e., code size) to another code property (i.e., # of bugs).

Project Goal

[ ] Implement a causal inference module that computes causal effects for software properties.
[ ] Implement a module that allows interventions or systematic transformations on source code data
[ ] Evaluate both modules for a given case study (i.e., traceability entropy).

Project Requirements

Required Knowledge: Python, Git, and Statistics
Preferred Knowledge: Deep Learning, TensorFlow, and DVC

Recommended Readings

WM-SEMERU / ds4se

do(code): A Causal Inference Framework to Understand and Explain Source Code Properties #95