Open timm opened 8 years ago
Hahaha, the images in issue #5 were from those papers! They have some awesome references which I'll use for our paper.
what we need is you to run some quick select queries over our test data. do you see this chart and "SEESAW"? that was an old tool of mine. but see how it does better than N other standard things?
what we need is this chart for the defect data sets with "SEESAW" replaced with "RANK" (the name of our method, currently, in this paper)
+---------+---------------------+-------------+
| | VARL (Shatnawi '10) | Filó et al. |
+ Metrics +---------------------+-------------+
| | Threshold | P-Value | Threshold |
+---------+-----------+---------+-------------+
| CBO | 1.78 | 0.000 | - |
+---------+-----------+---------+-------------+
| MAX_CC | 2.07 | 0.000 | - |
+---------+-----------+---------+-------------+
| AVG_CC | 0.86 | 0.003 | - |
+---------+-----------+---------+-------------+
| LCOM | 51 | 0.000 | 725 |
+---------+-----------+---------+-------------+
| LOC | 171.59 | 0.000 | 30 |
+---------+-----------+---------+-------------+
| NOC | - | - | 28 |
+---------+-----------+---------+-------------+
| CA | - | - | 39 |
+---------+-----------+---------+-------------+
| CE | - | - | 16 |
+---------+-----------+---------+-------------+
| DIT | - | - | 4 |
+---------+-----------+---------+-------------+
| WMC | - | - | 34 |
+---------+-----------+---------+-------------+
Goof. Will need bibtex entries for all papers you usr
Also... when these ranges are applied to the data, what effect do they have to the defect distribution?
Working on that, I'll have the results this evening. On Mar 2, 2016 8:44 AM, "Tim Menzies" notifications@github.com wrote:
Also... when these ranges are applied to the data, what effect do they have to the defect distribution?
— Reply to this email directly or view it on GitHub https://github.com/ai-se/XTREE-FSE/issues/7#issuecomment-191243591.
Nor that the home run would be that if Data set d, that RANK found good tteatments for, after division into good,bad (where bad = rows selected by threshold and good = all - bad) then the defect density is,about the same in good and bad (as witnessed by, say, box plots)
what are the thresholds in the tool that harman used to assess his refactorings?
Harman's refactoring tool thresholds. I'm looking in to this, comment as soon as I find it.
when can i get results from applying those thresholds?
In about an hour.. fixing some bugs.
+--------+-----------+---------+
| Metric | Threshold | P-Value |
+========+===========+=========+
| wmc | 14.67 | 0.000 |
+--------+-----------+---------+
| cbo | 30.13 | 0.000 |
+--------+-----------+---------+
| lcom | 849.16 | 0.000 |
+--------+-----------+---------+
| loc | 2951.64 | 0.000 |
+--------+-----------+---------+
| cam | 0.84 | 0.000 |
+--------+-----------+---------+
| ic | 5.29 | 0.000 |
+--------+-----------+---------+
| max_cc | 34.47 | 0.000 |
+--------+-----------+---------+
| avg_cc | 14.63 | 0.003 |
+--------+-----------+---------+
rank , name , med , iqr
----------------------------------------------------
1 , Reduce cam , 12.35 , 13.25 ( --* | ), 7.83, 13.86, 21.08
1 , Reduce wmc , 12.65 , 11.45 ( -* | ), 9.64, 12.65, 21.08
1 , Reduce avg_cc , 14.46 , 5.42 ( -* | ), 12.05, 15.06, 17.47
1 , Reduce loc , 15.06 , 13.25 ( -* | ), 10.24, 15.66, 23.49
1 , Reduce ic , 15.36 , 7.23 ( --* | ), 11.45, 16.87, 18.67
1 , Reduce cbo , 16.57 , 9.04 ( --* | ), 12.05, 18.07, 21.08
1 , Reduce lcom , 17.77 , 12.05 ( ---* | ), 9.64, 18.07, 21.69
1 , Reduce max_cc , 19.58 , 7.23 ( * | ), 16.87, 19.88, 24.10
2 , RANK , 47.89 , 30.72 ( ---*| ), 37.95, 48.19, 68.67
+--------+-----------+---------+
| Metric | Threshold | P-Value |
+========+===========+=========+
| wmc | 84.99 | 0.000 |
+--------+-----------+---------+
| cbo | 22.17 | 0.002 |
+--------+-----------+---------+
| lcom | 16048.61 | 0.027 |
+--------+-----------+---------+
| loc | 1668.51 | 0.000 |
+--------+-----------+---------+
| cam | 2.29 | 0.000 |
+--------+-----------+---------+
| max_cc | 31.06 | 0.034 |
+--------+-----------+---------+
| avg_cc | 30.91 | 0.026 |
+--------+-----------+---------+
rank , name , med , iqr
----------------------------------------------------
1 , Reduce cam , 20.00 , 15.00 ( -* | ), 15.00, 20.00, 30.00
1 , Reduce loc , 20.00 , 10.00 ( --* | ), 15.00, 22.50, 25.00
1 , Reduce cbo , 21.25 , 10.00 ( -* | ), 17.50, 22.50, 27.50
1 , Reduce max_cc , 21.25 , 7.50 ( -* | ), 17.50, 22.50, 25.00
1 , Reduce lcom , 22.50 , 2.50 ( * | ), 22.50, 22.50, 25.00
1 , Reduce wmc , 23.75 , 10.00 ( --* | ), 17.50, 25.00, 27.50
1 , Reduce avg_cc , 23.75 , 15.00 ( ---* | ), 17.50, 30.00, 32.50
2 , RANK , 57.50 , 12.50 ( -|-* ), 47.50, 57.50, 60.00
+--------+-----------+---------+
| Metric | Threshold | P-Value |
+========+===========+=========+
| lcom | 4092.69 | 0.000 |
+--------+-----------+---------+
| lcom3 | 4.78 | 0.000 |
+--------+-----------+---------+
| loc | 71055.23 | 0.000 |
+--------+-----------+---------+
| cam | 3.34 | 0.000 |
+--------+-----------+---------+
| ic | 26.97 | 0.000 |
+--------+-----------+---------+
rank , name , med , iqr
----------------------------------------------------
1 , Reduce cam , 8.54 , 1.07 ( * | ), 8.19, 8.90, 9.25
1 , Reduce lcom3 , 8.72 , 3.56 ( * | ), 7.12, 8.90, 10.68
1 , Reduce lcom , 8.90 , 2.49 ( * | ), 7.47, 8.90, 9.96
1 , Reduce loc , 9.07 , 2.85 ( * | ), 7.47, 9.25, 10.32
1 , Reduce ic , 9.96 , 2.14 ( * | ), 8.90, 9.96, 11.03
2 , RANK , 23.13 , 6.41 ( --* | ), 19.22, 23.84, 25.62
+--------+-----------+---------+
| Metric | Threshold | P-Value |
+========+===========+=========+
| dit | 14.47 | 0.000 |
+--------+-----------+---------+
| rfc | 20.73 | 0.000 |
+--------+-----------+---------+
| ca | 2.37 | 0.000 |
+--------+-----------+---------+
| ce | 2.69 | 0.000 |
+--------+-----------+---------+
| npm | 11.55 | 0.000 |
+--------+-----------+---------+
| lcom3 | 4.16 | 0.000 |
+--------+-----------+---------+
| loc | 61269.41 | 0.000 |
+--------+-----------+---------+
| dam | 0.53 | 0.000 |
+--------+-----------+---------+
| moa | 8.88 | 0.000 |
+--------+-----------+---------+
| cbm | 6.76 | 0.000 |
+--------+-----------+---------+
| amc | 510.48 | 0.001 |
+--------+-----------+---------+
| avg_cc | 2.02 | 0.000 |
+--------+-----------+---------+
rank , name , med , iqr
----------------------------------------------------
1 , Reduce dit , 36.36 , 9.09 ( * | ), 36.36, 36.36, 45.45
1 , Reduce rfc , 36.36 , 9.09 ( * | ), 36.36, 36.36, 45.45
1 , Reduce ca , 36.36 , 18.18 ( --* | ), 27.27, 36.36, 45.45
1 , Reduce ce , 36.36 , 18.18 ( --* | ), 27.27, 36.36, 45.45
1 , Reduce npm , 36.36 , 18.18 ( --* | ), 27.27, 36.36, 45.45
1 , Reduce lcom3 , 36.36 , 9.09 ( --* | ), 27.27, 36.36, 36.36
1 , Reduce loc , 36.36 , 9.09 ( * | ), 36.36, 36.36, 45.45
1 , Reduce dam , 36.36 , 27.27 ( -----* | ), 18.18, 36.36, 45.45
1 , Reduce moa , 36.36 , 36.36 ( --------* | ), 9.09, 36.36, 45.45
1 , Reduce cbm , 36.36 , 9.09 ( * | ), 36.36, 36.36, 45.45
1 , Reduce amc , 36.36 , 9.09 ( * | ), 36.36, 36.36, 45.45
1 , RANK , 36.36 , 0.00 ( * | ), 36.36, 36.36, 36.36
1 , Reduce avg_cc , 40.91 , 9.09 ( ---* | ), 36.36, 45.45, 45.45
:smirk:
I only retain metrics with valid thresholds with P<0.05.
so is the deal that the 2010 TSE paper defines a procedure for finding thresholds? and you applied that procedure and got the above? what is that procedure? please answer in enough detail so i can succinctly but authoritatively write this down n the paper.
so is the deal that the 2010 TSE paper defines a procedure for finding thresholds? and you applied that procedure and got the above?
Yup, that's right.
what is that procedure? please answer in enough detail so I can succinctly but authoritatively write this down n the paper.
In our work, we have coded fault-free classes as zero, and faulty classes as one. We could leverage this binary nature to apply a Univariate Binary Logistic Regression (UBR) to identify metrics that have a significant association with the occurrence of defects. To set a cut-off for this association, we use a confidence interval of 95\%.
To identify thresholds for the metrics that we significant, we use a method called Value of Acceptable Risk Level (VARL) first proposed by Bender~\cite{bender99} in identifying thresholds in epidemiology studies. In his TSE 2010 article, Shatnawi~\cite{shatnawi10} endorsed the use of this method in identifying thresholds in object-oriented metrics for open source software systems.
The VARL method measures cut-off values in metrics such that, below that threshold, the probability of occurrence of defect is less than a probability $p_0$. To do this, we fit a Univariate Binary Logistic Regression (UBR) to the metrics. For every significant metric, this generates a general logistic regression model with a constant intercept ($\alpha$) and a coefficient for maximizing log-likelihood function ($\beta$). With these, the VARL is measure as follows:
\begin{equation} VARL = \frac{1}{\beta }\left( {\log \left( {\frac{{{p_0}}}{{1 - {p_0}}}} \right) - \alpha } \right) \end{equation}
why are these thresholds different in different data sets?
It is highly unlikely that the metrics have a similar impact on all data sets. Therefore, we must run the model on a data set to identify metrics and corresponding thresholds that matter.
v.good
{\scriptsize \begin{tabular}{l@{~~~}l@{~~~}r@{~~~}r@{~~~}c}
\arrayrulecolor{lightgray}
\textbf{Rank} & \textbf{Treatment} & \textbf{Median} & \textbf{IQR} & \\\hline
1 & RANK & 57.83 & 29.52 & \quart{46}{33}{66}{1} \\
\hline 2 & Reduce cbo & 16.27 & 4.21 & \quart{15}{5}{18}{1} \\
2 & Reduce loc & 15.66 & 2.41 & \quart{16}{3}{17}{1} \\
2 & Reduce cam & 15.06 & 3.01 & \quart{16}{3}{17}{1} \\
2 & Reduce avg_cc & 15.66 & 3.01 & \quart{16}{3}{17}{1} \\
2 & Reduce ic & 15.66 & 3.61 & \quart{15}{4}{17}{1} \\
2 & Reduce lcom & 15.66 & 4.82 & \quart{14}{5}{17}{1} \\
2 & Reduce wmc & 15.66 & 3.01 & \quart{15}{4}{17}{1} \\
2 & Reduce max_cc & 15.06 & 2.41 & \quart{15}{3}{17}{1} \\
\hline \end{tabular}}
{\scriptsize \begin{tabular}{l@{~~~}l@{~~~}r@{~~~}r@{~~~}c}
\arrayrulecolor{lightgray}
\textbf{Rank} & \textbf{Treatment} & \textbf{Median} & \textbf{IQR} & \\\hline
1 & RANK & 52.5 & 17.5 & \quart{57}{22}{67}{1} \\
\hline 2 & Reduce avg_cc & 22.5 & 7.5 & \quart{25}{10}{28}{1} \\
2 & Reduce loc & 22.5 & 10.0 & \quart{22}{13}{28}{1} \\
2 & Reduce cbo & 22.5 & 10.0 & \quart{22}{13}{28}{1} \\
2 & Reduce wmc & 22.5 & 7.5 & \quart{22}{9}{28}{1} \\
2 & Reduce max_cc & 20.0 & 7.5 & \quart{22}{9}{25}{1} \\
2 & Reduce cam & 20.0 & 10.0 & \quart{22}{13}{25}{1} \\
\hline \end{tabular}}
{\scriptsize \begin{tabular}{l@{~~~}l@{~~~}r@{~~~}r@{~~~}c}
\arrayrulecolor{lightgray}
\textbf{Rank} & \textbf{Treatment} & \textbf{Median} & \textbf{IQR} & \\\hline
1 & RANK & 19.93 & 12.11 & \quart{46}{33}{54}{2} \\
\hline 2 & Reduce lcom & 9.25 & 1.43 & \quart{23}{4}{25}{2} \\
2 & Reduce ic & 9.25 & 1.43 & \quart{23}{4}{25}{2} \\
2 & Reduce lcom3 & 8.9 & 1.77 & \quart{22}{5}{24}{2} \\
\hline 3 & Reduce loc & 8.9 & 2.14 & \quart{20}{6}{24}{2} \\
3 & Reduce cam & 8.53 & 1.78 & \quart{21}{5}{23}{2} \\
\hline \end{tabular}}
{\scriptsize \begin{tabular}{l@{~~~}l@{~~~}r@{~~~}r@{~~~}c}
\arrayrulecolor{lightgray}
\textbf{Rank} & \textbf{Treatment} & \textbf{Median} & \textbf{IQR} & \\\
1 & Reduce dam & 36.36 & 27.28 & \quart{34}{34}{45}{1} \\
1 & Reduce moa & 36.36 & 18.19 & \quart{45}{23}{45}{1} \\
1 & Reduce rfc & 45.45 & 18.19 & \quart{45}{23}{57}{1} \\
1 & Reduce ca & 45.45 & 18.19 & \quart{45}{23}{57}{1} \\
1 & Reduce ce & 45.45 & 18.19 & \quart{45}{23}{57}{1} \\
1 & Reduce npm & 45.45 & 18.19 & \quart{45}{23}{57}{1} \\
1 & Reduce loc & 45.45 & 9.09 & \quart{45}{12}{57}{1} \\
1 & Reduce amc & 45.45 & 27.28 & \quart{45}{34}{57}{1} \\
1 & Reduce avg_cc & 45.45 & 18.19 & \quart{45}{23}{57}{1} \\
\hline 2 & Reduce dit & 36.36 & 36.37 & \quart{22}{46}{45}{1} \\
2 & Reduce lcom3 & 36.36 & 18.19 & \quart{45}{23}{45}{1} \\
2 & Reduce cbm & 36.36 & 18.19 & \quart{45}{23}{45}{1} \\
2 & RANK & 36.36 & 0.0 & \quart{45}{0}{45}{1} \\
\hline \end{tabular}}
{\scriptsize \begin{tabular}{l@{~~~}l@{~~~}r@{~~~}r@{~~~}c}
\arrayrulecolor{lightgray}
\textbf{Rank} & \textbf{Treatment} & \textbf{Median} & \textbf{IQR} & \\\hline
1 & RANK & 14.78 & 4.92 & \quart{57}{22}{66}{4} \\
1 & Reduce lcom3 & 15.76 & 1.96 & \quart{68}{9}{71}{4} \\
1 & Reduce moa & 15.76 & 2.45 & \quart{66}{11}{71}{4} \\
1 & Reduce cbo & 16.26 & 1.97 & \quart{71}{8}{73}{4} \\
1 & Reduce npm & 16.26 & 2.46 & \quart{68}{11}{73}{4} \\
1 & Reduce loc & 16.75 & 2.46 & \quart{68}{11}{75}{4} \\
\hline \end{tabular}}
re harman's threshold technique
re harman's threshold technique
There are 2 references.
@article{hermans15,
title={Detecting and refactoring code smells in spreadsheet formulas},
author={Hermans, Felienne and Pinzger, Martin and van Deursen, Arie},
journal={Empirical Software Engineering},
volume={20},
number={2},
pages={549--575},
year={2015},
publisher={Springer}
}
@inproceedings{Alves2010,
author = {Alves, Tiago L. and Ypma, Christiaan and Visser, Joost},
booktitle = {2010 IEEE Int. Conf. Softw. Maint.},
doi = {10.1109/ICSM.2010.5609747},
benchmark data - 2010.pdf:pdf},
isbn = {978-1-4244-8630-4},
issn = {10636773},
mendeley-groups = {OO Metric Thresholds},
month = {sep},
pages = {1--10},
publisher = {IEEE},
title = {{Deriving metric thresholds from benchmark data}},
url = {http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5609747},
year = {2010}
}
They seem to use a benchmark data set to derive a set of common thresholds. Since, we don't have that, we can derive thresholds separately for every data set. The technique is straightforward.
In addition to using VARL to identify thresholds as proposed by Shatnawi. We another alternative method proposed by Alves et al~\cite{alves10}. This method is unique in that respects the underlying statistical distribution and scale of the metrics. It works as follows.
Evey metric value is weighted according to the source lines of code (LOC) of the class. All the weighted metrics are then normalized i.e., they are divided by the sum of all weights of the same system. Following this, the normalized metric values are ordered in an ascending fashion. This is equivalent to computing a density function, in which the x-axis represents the weight ratio (0-100%), and the y-axis the metric scale.
Thresholds are then derived by choosing the percentage of the overall code that needs to be represented. For instance, Alves et al suggest the use 90% quantile of the overall code to derive the threshold for a specific metric. This threshold is meaningful since it can be used to identify 10% of the worst code with respect to a specific metric. And thresholds greater than 90\% represent very-high risk.
{\scriptsize \begin{tabular}{l@{~~~}l@{~~~}r@{~~~}r@{~~~}c}
\arrayrulecolor{lightgray}
\textbf{Rank} & \textbf{Treatment} & \textbf{Median} & \textbf{IQR} & \\\hline
1 & RANK & 63.25 & 24.1 & \quart{53}{26}{70}{1} \\
\hline 2 & Reduce wmc & 22.29 & 6.63 & \quart{19}{7}{24}{1} \\
2 & Reduce max_cc & 21.69 & 7.23 & \quart{18}{8}{24}{1} \\
2 & Reduce loc & 21.69 & 4.82 & \quart{20}{6}{24}{1} \\
2 & Reduce lcom & 21.69 & 4.82 & \quart{22}{6}{24}{1} \\
2 & Reduce cbo & 21.69 & 4.82 & \quart{21}{5}{24}{1} \\
2 & Reduce ic & 21.69 & 5.43 & \quart{20}{6}{24}{1} \\
2 & Reduce cbm & 21.08 & 5.43 & \quart{20}{6}{23}{1} \\
2 & Reduce dam & 21.08 & 6.02 & \quart{21}{7}{23}{1} \\
2 & Reduce npm & 21.08 & 5.43 & \quart{20}{6}{23}{1} \\
2 & Reduce rfc & 21.08 & 3.61 & \quart{21}{4}{23}{1} \\
2 & Reduce cam & 21.08 & 4.22 & \quart{20}{5}{23}{1} \\
2 & Reduce moa & 19.88 & 5.42 & \quart{20}{6}{22}{1} \\
2 & Reduce ce & 20.48 & 4.21 & \quart{21}{5}{22}{1} \\
2 & Reduce avg_cc & 19.88 & 7.23 & \quart{19}{8}{22}{1} \\
\hline \end{tabular}}
{\scriptsize \begin{tabular}{l@{~~~}l@{~~~}r@{~~~}r@{~~~}c}
\arrayrulecolor{lightgray}
\textbf{Rank} & \textbf{Treatment} & \textbf{Median} & \textbf{IQR} & \\\hline
1 & Reduce noc & 30.0 & 15.0 & \quart{31}{20}{38}{1} \\
1 & Reduce amc & 30.0 & 12.5 & \quart{31}{16}{38}{1} \\
1 & Reduce ce & 30.0 & 12.5 & \quart{35}{16}{38}{1} \\
1 & Reduce lcom & 32.5 & 10.0 & \quart{35}{12}{41}{1} \\
1 & Reduce loc & 32.5 & 12.5 & \quart{35}{16}{41}{1} \\
1 & Reduce wmc & 32.5 & 17.5 & \quart{31}{23}{41}{1} \\
1 & Reduce cbo & 35.0 & 12.5 & \quart{35}{16}{44}{1} \\
1 & Reduce rfc & 35.0 & 12.5 & \quart{35}{16}{44}{1} \\
1 & Reduce npm & 35.0 & 7.5 & \quart{38}{9}{44}{1} \\
1 & Reduce cam & 35.0 & 15.0 & \quart{38}{19}{44}{1} \\
1 & Reduce max_cc & 35.0 & 12.5 & \quart{35}{16}{44}{1} \\
1 & Reduce avg_cc & 35.0 & 15.0 & \quart{35}{19}{44}{1} \\
1 & Reduce cbm & 40.0 & 17.5 & \quart{38}{22}{51}{1} \\
\hline 2 & RANK & 52.5 & 20.0 & \quart{54}{25}{67}{1} \\
\hline \end{tabular}}
{\scriptsize \begin{tabular}{l@{~~~}l@{~~~}r@{~~~}r@{~~~}c}
\arrayrulecolor{lightgray}
\textbf{Rank} & \textbf{Treatment} & \textbf{Median} & \textbf{IQR} & \\\hline
1 & Reduce wmc & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
1 & Reduce dit & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
1 & Reduce cbo & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
1 & Reduce rfc & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
1 & Reduce lcom & 36.36 & 36.36 & \quart{0}{79}{79}{2} \\
1 & Reduce ca & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
1 & Reduce ce & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
1 & Reduce npm & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
1 & Reduce lcom3 & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
1 & Reduce loc & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
1 & Reduce dam & 36.36 & 36.36 & \quart{0}{79}{79}{2} \\
1 & Reduce moa & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
1 & Reduce cam & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
1 & Reduce ic & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
1 & Reduce cbm & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
1 & Reduce amc & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
1 & Reduce max_cc & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
1 & Reduce avg_cc & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
1 & RANK & 36.36 & 0.0 & \quart{79}{0}{79}{2} \\
\hline \end{tabular}}
{\scriptsize \begin{tabular}{l@{~~~}l@{~~~}r@{~~~}r@{~~~}c}
\arrayrulecolor{lightgray}
\textbf{Rank} & \textbf{Treatment} & \textbf{Median} & \textbf{IQR} & \\\hline
1 & Reduce lcom & 14.78 & 2.46 & \quart{51}{9}{57}{3} \\
1 & Reduce dam & 14.78 & 1.97 & \quart{55}{7}{57}{3} \\
1 & Reduce npm & 15.27 & 2.96 & \quart{53}{11}{59}{3} \\
1 & Reduce cam & 15.27 & 2.46 & \quart{55}{9}{59}{3} \\
1 & Reduce rfc & 15.76 & 1.48 & \quart{57}{5}{60}{3} \\
1 & Reduce lcom3 & 15.76 & 1.97 & \quart{57}{7}{60}{3} \\
1 & Reduce loc & 15.76 & 2.96 & \quart{53}{11}{60}{3} \\
\hline 2 & Reduce cbo & 15.76 & 2.94 & \quart{55}{11}{60}{3} \\
2 & Reduce cbm & 15.76 & 2.45 & \quart{57}{9}{60}{3} \\
2 & Reduce wmc & 16.26 & 2.94 & \quart{55}{11}{62}{3} \\
2 & Reduce ce & 16.26 & 2.45 & \quart{57}{9}{62}{3} \\
2 & Reduce amc & 16.26 & 2.46 & \quart{55}{9}{62}{3} \\
2 & Reduce moa & 16.26 & 1.96 & \quart{59}{7}{62}{3} \\
2 & RANK & 16.75 & 7.88 & \quart{49}{30}{64}{3} \\
\hline \end{tabular}}
{\scriptsize \begin{tabular}{l@{~~~}l@{~~~}r@{~~~}r@{~~~}c}
\arrayrulecolor{lightgray}
\textbf{Rank} & \textbf{Treatment} & \textbf{Median} & \textbf{IQR} & \\\hline
1 & Reduce lcom & 9.61 & 2.86 & \quart{25}{9}{28}{2} \\
1 & Reduce npm & 9.61 & 4.27 & \quart{23}{13}{28}{2} \\
1 & Reduce lcom3 & 9.61 & 2.15 & \quart{25}{7}{28}{2} \\
1 & Reduce ic & 9.96 & 3.2 & \quart{26}{10}{29}{2} \\
1 & Reduce amc & 9.96 & 1.78 & \quart{26}{6}{29}{2} \\
1 & Reduce ce & 9.96 & 2.86 & \quart{25}{9}{29}{2} \\
1 & Reduce rfc & 10.32 & 2.84 & \quart{26}{9}{30}{2} \\
1 & Reduce moa & 10.32 & 2.13 & \quart{26}{7}{30}{2} \\
1 & Reduce mfa & 10.32 & 3.21 & \quart{25}{10}{30}{2} \\
1 & Reduce wmc & 10.32 & 2.13 & \quart{28}{7}{30}{2} \\
\hline 2 & Reduce dit & 10.68 & 2.13 & \quart{28}{7}{32}{2} \\
2 & Reduce cam & 10.68 & 3.21 & \quart{25}{10}{32}{2} \\
2 & Reduce max_cc & 10.32 & 3.2 & \quart{26}{10}{30}{2} \\
2 & Reduce loc & 11.03 & 3.2 & \quart{28}{10}{33}{2} \\
2 & Reduce cbm & 11.39 & 2.14 & \quart{29}{7}{34}{2} \\
\hline 3 & RANK & 20.64 & 8.9 & \quart{53}{26}{61}{2} \\
\hline \end{tabular}}
In our work, we have coded fault-free classes as zero, and faulty classes as one. We could leverage this binary nature to apply a Univariate Binary Logistic Regression (UBR) to identify metrics that have a significant association with the occurrence of defects. To set a cut-off for this association, we use a confidence interval of 95\%.
To identify thresholds for the metrics that we significant, we use a method called Value of Acceptable Risk Level (VARL) first proposed by Bender~\cite{bender99} in identifying thresholds in epidemiology studies. In his TSE 2010 article, Shatnawi~\cite{shatnawi10} endorsed the use of this method in identifying thresholds in object-oriented metrics for open source software systems.
The VARL method measures cut-off values in metrics such that, below that threshold, the probability of occurrence of defect is less than a probability $p_0$. To do this, we fit a Univariate Binary Logistic Regression (UBR) to the metrics. For every significant metric, this generates a general logistic regression model with a constant intercept ($\alpha$) and a coefficient for maximizing log-likelihood function ($\beta$). With these, the VARL is measure as follows:
\begin{equation} VARL = \frac{1}{\beta }\left( {\log \left( {\frac{{{p_0}}}{{1 - {p_0}}}} \right) - \alpha } \right) \end{equation}
In addition to using VARL to identify thresholds as proposed by Shatnawi. We another alternative method proposed by Alves et al~\cite{alves10}. This method is unique in that respects the underlying statistical distribution and scale of the metrics. It works as follows.
Evey metric value is weighted according to the source lines of code (LOC) of the class. All the weighted metrics are then normalized i.e., they are divided by the sum of all weights of the same system. Following this, the normalized metric values are ordered in an ascending fashion. This is equivalent to computing a density function, in which the x-axis represents the weight ratio (0-100%), and the y-axis the metric scale.
Thresholds are then derived by choosing the percentage of the overall code that needs to be represented. For instance, Alves et al suggest the use 90% quantile of the overall code to derive the threshold for a specific metric. This threshold is meaningful since it can be used to identify 10% of the worst code with respect to a specific metric. And thresholds greater than 90\% represent a very high risk.
One of the first methods of finding thresholds was proposed Erni and Lewerentz~\cite{erni96}. Their technique to identify thresholds was based on the data distribution, specifically the mean and the standard deviation of the metric values. They propose the use of values that lie beyond one standard deviation from the mean as a threshold. The minimum value $T{min}$ is given by $T{min}=\mu-\sigma$, and this is used when metric definition considers very small values as an indicator of problems. Otherwise, $T_{max}=\mu+\sigma$ is used, when large metric values are considered problematic.
Several researchers~\cite{shatnawi10}~cite{alves10} have pointed out that this method is subject to a few problems. Firstly, it doesn't consider the fault-proneness of classes when the thresholds are computed. Secondly, there is a lack of empirical validation of this methodology, which impedes reasonable comparisons.
Need to know what happens when thresholds from N sources are applied to our data sets. ANd i need that written up and into the paper.
BTW, Here'a a paper that does what we hate. menthions metrcs but not thresholds