Open rahlk opened 8 years ago
rahul. When does Alves apply univariant regression to compute the p values and reject anything with p > 0.05? on the normalized data?
Alves doesn't use P-Values, their method is unsupervised. Shatnawi uses p-values. I used P-Values in both cases for the sake of consistency.
P-values of what? apply univariant regression to compute the p values and reject anything with p > 0.05? on the normalized data?
P values of the relationship between metrics and the bug counts. We respect metrics with p<0.05. On Mar 9, 2016 17:48, "Tim Menzies" notifications@github.com wrote:
P-values of what? apply univariant regression to compute the p values and reject anything with p > 0.05? on the normalized data?
— Reply to this email directly or view it on GitHub https://github.com/ai-se/XTREE-FSE/issues/11#issuecomment-194551253.
Summary Shatnawi10
In our work, we have coded fault-free classes as zero, and faulty classes as one. We could leverage this binary nature to apply a Univariate Binary Logistic Regression (UBR) to identify metrics that have a significant association with the occurrence of defects. To set a cut-off for this association, we use a confidence interval of 95\%.
To identify thresholds for the metrics that we significant, we use a method called Value of Acceptable Risk Level (VARL) first proposed by Bender~\cite{bender99} in identifying thresholds in epidemiology studies. In his TSE 2010 article, Shatnawi~\cite{shatnawi10} endorsed the use of this method in identifying thresholds in object-oriented metrics for open source software systems.
The VARL method measures cut-off values in metrics such that, below that threshold, the probability of occurrence of defect is less than a probability $p_0$. To do this, we fit a Univariate Binary Logistic Regression (UBR) to the metrics. For every significant metric, this generates a general logistic regression model with a constant intercept ($\alpha$) and a coefficient for maximizing log-likelihood function ($\beta$). With these, the VARL is measure as follows:
\begin{equation} VARL = \frac{1}{\beta }\left( {\log \left( {\frac{{{p_0}}}{{1 - {p_0}}}} \right) - \alpha } \right) \end{equation}
Summary Alves10
In addition to using VARL to identify thresholds as proposed by Shatnawi. We another alternative method proposed by Alves et al~\cite{alves10}. This method is unique in that respects the underlying statistical distribution and scale of the metrics. It works as follows.
Evey metric value is weighted according to the source lines of code (LOC) of the class. All the weighted metrics are then normalized i.e., they are divided by the sum of all weights of the same system. Following this, the normalized metric values are ordered in an ascending fashion. This is equivalent to computing a density function, in which the x-axis represents the weight ratio (0-100%), and the y-axis the metric scale.
Thresholds are then derived by choosing the percentage of the overall code that needs to be represented. For instance, Alves et al suggest the use 90% quantile of the overall code to derive the threshold for a specific metric. This threshold is meaningful since it can be used to identify 10% of the worst code with respect to a specific metric. And thresholds greater than 90\% represent a very high risk.
Deprecated Method
One of the first methods of finding thresholds was proposed Erni and Lewerentz~\cite{erni96}. Their technique to identify thresholds was based on the data distribution, specifically the mean and the standard deviation of the metric values. They propose the use of values that lie beyond one standard deviation from the mean as a threshold. The minimum value $T{min}$ is given by $T{min}=\mu-\sigma$, and this is used when metric definition considers very small values as an indicator of problems. Otherwise, $T_{max}=\mu+\sigma$ is used, when large metric values are considered problematic.
Several researchers~\cite{shatnawi10}~cite{alves10} have pointed out that this method is subject to a few problems. Firstly, it doesn't consider the fault-proneness of classes when the thresholds are computed. Secondly, there is a lack of empirical validation of this methodology, which impedes reasonable comparisons.