ai-se / cocreport

rework of an ICSE submission for FSE'15
0 stars 1 forks source link

working the review form emse #25

Closed timm closed 8 years ago

timm commented 8 years ago

@bigfatnoob: the real big tasks are:

COMMENTS FOR THE AUTHOR: Special issue editor comments: thanks for your submission. The reviewers identify a number of strong points about your submission, but also a number of issues related to presentation, emphasis (i.e., more emphasis of the 'negative' result is needed) and clarification of related work. Please take these comments into account when preparing your revision.

looks like we dump the peeking results

Abstract. One of the puzzling claims of the paper is "For the non-COCOMO data sets, our newer estimation methods performed better than older methods". However, the only place were the non-COCOMO datasets are considered is Section 5.2. In this section the authors state that "the key feature of these data sets is COCOMO and COCONUT could not be applied to these since, apart from some effort measure, this data is described using attributes with little (if any) similarity to Figure 3" However, no new approach has been presented to tackle these datasets; the technique that produced the best results is PEEKING2 introduced by Vasil Papakroni [56]. Papakroni is not among the authors of the current paper, so reference to "our newer estimation methods" is a strange.

irrelevant: dumping peeking

I'm also concerned about the choice of the datasets. Two of the datasets considered are affiliated with COCOMO, so they can provide a biased perspective on the power of the results. Are there additional datasets one might consider or are the COCOMO requirements so restrictive that no such datasets can be found unless at NASA?

sorry, that's it. and we moved heaven and earth to get these

— "evidence: we know of only two such report in the last five years of conferences on software analytics [47, 54])".There are more software engineering papers that have published errata in the recent years (but not necessarily via the main stream venues such as conferences and journals): JungWon Byun , SungYul Rhew, ManSoo Hwang, Vijayan Sugumaran, SooYong Park, SooJin Park Erratum to: Metrics for measuring the consistencies of requirements with objectives and constraints. Requirements Engineering March 2014, Volume 19, Issue 1`

@timm : minor add

Failure report, negative result and report on a previously published error are not the same; the introduction somehow puts them all together and tries to explain them with the same reasons.

@timm: minor add

"Secondly, it is not standard practice in software analytics for researchers to benchmark their latest results against some supposedly simpler "straw man" method." My guess would be that in many cases no such method is available making benchmarking impossible. When the domain starts to become more mature, this kind of comparisons are bing conducted (see, e.g., code cloning or a more recent work on tag inference for Stack Overflow posts).

@timm: minor add. cohen refernce

3.2 Choice of learners. The authors should have been more careful with introducing the techniques the use. It is not very clear what the "triangle function of Walkerden and Jeffery [68]" is precisely. Is it merely an idea that three variables should be combined as a weighted sum? Or is this a particular approach that can be used to derive coefficients 50, 33 and 17? I've checked the paper by Walkerden and Jeffery [68] but the triangle function is not mentioned.

@timm: minor edit

Is m.\mu the median of m (as suggested by the preceding discussion) or the mean of m (as seems to follow from the paper by Mittas & Angelis [52])? Are the means reliable? Why wouldn't one use a one-shot multiple comparisons procedure such as T~ of Konietschke et al.?

@bigfatnoob : are we doing mean or median"

4.4 COCOMO with Incorrect Size Estimates "That is, the parametric estimation method being recommended here is not unduly affected by noise where the KLOC values vary up to 50% of their original value." I wonder whether one can expect estimation to be up to 50% of the real value. Laranjeira in "Software Size Estimation of Object-Oriented Systems" (1990) has reported size estimates that were 3-4 times higher or lower than the actual system size; in two cases the estimates were as bas as 6 times off. This makes me wonder whether the "not unduly sensitive" in the conclusion does not overstate the results.

@bigfatnoob: can we repeat this part with max errors +/- 0.5, 1,2,4

Minor Section 3: reserach Section 4.3.4: thse Section 4.4: redundant ) The entire second half of the PDF contains the output of the LaTeX compilation. I presume that something went wrong during the submission process.

@timm minor edit

Reviewer #2: Generally, the paper has a large number of different elements that makes it difficult to follow. If you are interested in the value of COCOMO-II why not stick to that.

so lets dump peeking 2

@bigfatnoob: sometime in the near future, we need a second paper singing the praises of peeking2.

@ brittjay0104: we need some small data sets to study to check out our preferred learner for small data sets. what is yur future... when do you expect N>1 data sets in your world?

The "strawman" in inappropriate (use regression with a logarithmic transformation instead [5]) and the use of MRE is incorrect (use an R-squared or similar measure).

@timm. the strawman is the standard in the cocomo world. but the reviewer seems enamored with Whigham, below,

@bigfatnoob. need to read the Whigham paper. decide if we want to do that

Also for a special issue concerned with negative results, the authors seem keen to promote the positive value of their PEEKING2 model rather more than the results imply.

@timm dump peeking

A "strawman" should not be something that is so foolish that it is impossible not to outdo. Researchers in the past have used simple regression without a logarithmic transformation and MMRE as an accuracy measure for comparison with new estimation methods. Kitchenham & Mendes pointed out this was not a valid approach.

@timm: we dont... we ust say that this was what was used in the past and we do no better than that. be careful here. reviewer2 has well formed views

RQ1 I don't think the issue was ever about "just lines of code" it was about simple regression rather than a complex deterministic model. However, the function point community do recommend the taking the average of the productivity (Effort/FP) then multiplying by the FP estimate of a project needing a cost estimate.

@timm: issue is what is standard in the cocomo world. and we dont know the langauges here so FP count is hard

This paper uses two proprietary data sets which means the results cannot be independently validated. This is not very satisfactory. We have too many data mining studies that cannot be reproduced.

@timm : msoft does that all the time. this work more producible that most

Figure 2 . The Coc81 data set has 63 projects (or did you remove one of them?)

@bigfatnoob. please check

Section 2.3 Jorgensen didn't talk about "Delphi-based" methods he talked about "expert-based" estimation. Also at the time most researchers emphasized that automated methods were superior to expert opinion - so his result might be described as "negative".

@timm: edit

Section 3 Equation (3) You should not be using the biased MRE measure. The poor behavior of MRE has been well documented in leading journals (see ref [1], [3] & [4]. It doesn't matter whether you use the median or mean MRE the metric is still biased. Furthermore if you use it as a goodness of fit measure as well as a performance measure you compound the error [5]. If you really want a measure between 0 and 1 to assess performance, use an R-squared equivalent such as the ratio of the squared difference between the estimate and actual divided by the squared difference between the actual and the average effort. Use a robust version if you prefer that is the absolute difference between actual and estimate divided by the absolute difference between the actual and median.

@bigfatnoob: can you redo ALL our results charts (not the later stuff) using the squared difference between the estimate and actual divided by the squared difference between the actual and the average effort.

Shepperd and MacDonnell's statistics is more of a test as to whether any prediction is taking place at all. (Its similar to the mean in a regression model being the default best estimate if the attributes do not affect the outcome measure.)

@timm edit

Wingham et al.'s default model seems more appropriate as a useful strawman (see [5]) and is certainly better than lines of code.

@bigfatnoob: please read the wingham paper. any technical issues that prevent us using it?

From your description it appears that TEAK removes projects - how then do you do a fair comparison with other methods ? If you remove hard to predict projects its easy to improve any accuracy measure compared with a method that predicts all projects. You imply that PEEKING2 is more aggressive than TEAK - does that means it rejects projects as well?

@tim m dont show teack

Section 4.2 The COC81 results must be partly due to the way in which the COCOMO-II model was originally developed. The model was developed to fit the dataset which included the COCOMO81 data. It's not surprising that the model fits the COC81 data well.

@timm: add this comment

The results for NASA93 are more interesting since the data is public domain. However, given that there is no description in the COCOMO-II of the origin of the 161 projects it is possible that the datasets include duplicate projects. In case you think this is impossible, please note that there are two data sets in the PROMISE repository that contain duplicate projects (Cocomo NASA & Cocomo NASA 2 - which incidentally has 93 projects). Since Barry Boehm is a co-author, it should be simple to confirm whether or not any of the NASA93 projects were used in the construction of COCOMO-II.

@timm: ask jairus

The results for NASA 10 and COC05 are hearsay.

@timm: i guess that is the pubic domain stuff.check with jairus and ye about making that data pubnlic.

4.3.1 This seems like a good idea. There are too many levels (which should be treated as dummy variables) for a data-based calibration process to work well without a very large data base.

@timm: acknowledge

4.4 It would be useful to plot the actual size against the recalculated size. If the revised size is still correlated with the actual size you may only have demonstrated that the exponential term in the COCOMO model is close to 1.

@bigfatnoob: please do

Section 5 I am not clear why the authors felt it necessary to include Figures 13 & 14 other than to have something positive to report. However, there does not appear to be much difference between PEEKING2 and Knear(3) at least not really sufficient to claim that PEEKING2 performed best.

@timm: remove peeking

Several times you talk about the problem of large variances in effort estimation and suggest working n variance reduction - but surely that is what cost estimation models are all about. Do you have anything more specific to say?

@timm: add

[1] Foss, T., Stensrud, E., Kitchenham, B., and Myrtveit, I. A simulation study of the model evaluation criterion MMRE. IEEE Trans. on Softw. Eng. 29, 11 (2003), 985-995.

[2] Kitchenham, B., and Mendes, E. Why comparative effort prediction studies may be invalid. In PROMISE 2009 (2009), ACM.

[3] Myrtveit, I., and Stensrud, E. Validity and reliability of evaluation procedures in comparative studies of effort prediction models. Empirical Software Engineering 17, 1-2 (2012), 23-33.

[4] Myrtveit, I., Stensrud, E., and Shepperd, M. Reliability and validity in comparative studies of software prediction models. IEEE Trans. on Softw. Eng. 31, 5 (2005), 380-391.

[5] Shepperd, M., and MacDonell, S. Evaluating prediction systems in software project estimation. Info. & Softw. Technol. 54, 8 (2012), 820-827.

[6] Whigham, P., Owen, C., and Macdonell, S. A baseline model for software effort estimation. ACM Trans. on Softw. Eng. & Methodol. 24, 3 (2015).

bigfatnoob commented 8 years ago

Is m.\mu the median of m (as suggested by the preceding discussion) or the mean of m (as seems to follow from the paper by Mittas & Angelis [52])? Are the means reliable? Why wouldn't one use a one-shot multiple comparisons procedure such as T~ of Konietschke et al.?

We are using median since means are unreliable. But yeah Mittas and Angelis suggests using mean. We should probably state that we have adopted the technique and replaced mean with median due to the unreliability in mean as a result of outliers

timm commented 8 years ago

just a thought- not a "do" just yet- but if we ran peeker2 against Whigham and did better we could offer a better baseline... fully coded in python for all to use.

timm commented 8 years ago

From jairus:

We cannot due FP. We do not have the info. And we have found that for the nasa data that language is insignificant on this older data set

bigfatnoob commented 8 years ago

@timm

Figure 2 . The Coc81 data set has 63 projects (or did you remove one of them?)

Our dataset has 63 projects. Its a mistake in the figure we've used.