2nd IST review - Githubissues

amritbhanu commented 7 years ago

Deadline: Nov 20, 2017 Reviews:

Sorry for the delay in the response to your revision. Both reviewers acknowledge good progress, but have some outstanding issue you need to look into:

[x] Why categorizing StackExchange data into relevant and non relevant categories useful (Motivation of the SENG related example)
[x] Use LDA-GA for the classification task in Section 5.3 and demonstrate whether or not LDADE outperforms LDA-GA
[x] Provide a deeper discussion on the very low number of evaluations of the proposed algorithm.

-Reviewer 1

Thank you for the addressing my comments. I'm happy with most of the changes made.

Let me clear one misunderstanding.

A.2. The experiments need to be improved in the following ways/ First, LDA has been used to help many realistic software engineering tasks (for example, tasks considered by papers included in Table 2. There is a need to expand the experiments to compare LDA and LDADE on those realistic software engineering tasks. It is unclear if the task considered in the experiments (Section 5.3) is realistic (why categorizing StackExchange data into relevant and non relevant categories useful?). More than one tasks should have been considered (similar like Panichella et al.’s work).

Our goal in this revision was to be as responsive as possible to your suggestions (as as you can see below, in A.4, A.5 and A.7, we were able to implement much of your advice.) But as to applying this to “more realistic SE task”, we might have a different perspective on what is a “valid” SE task. We say this since it sounds like you are saying the unsupervised tasks conducted by 23 of 28 recent highly cited LDA papers are not “realistic”? And that we should evaluate this paper only via the supervised tasks seen in 4 of the 28 papers? That is not our view, please see the notes above in A.1 on the need for stability in unsupervised SE tasks.

From my original comment "LDA has been used to help many realistic software engineering tasks (for example, tasks considered by papers included in Table 2)",

I'm not saying that the unsupervised tasks considered prior work listed in Table 2 are not realistic. The point that I would like to convey is: the paper does not evaluate LDADE and LDA directly for any of the supervised or unsupervised tasks listed in Table 2. Considering several of the unsupervised or supervised tasks in Table 2 will be good. For example, consider paper [7] listed in Table 2, can using LDADE give additional/differing conclusion about topics that developers are talking about in Stack Overflow? I leave it to the authors to either perform this analysis or not, but at least the limitation should be acknowledged in the paper and maybe considered as a future work.

[x] 1.1 add in stabilizied topics from ahmed

Also, there is a need to better motivate the supervised task considered in Section 5.3 (to answer the question: why categorizing StackExchange data into relevant and non relevant categories useful?).

[x] 1.2 Talk about tar and the cost of reading legal cods

There is also a need to use LDA-GA for the classification task in Section 5.3 and demonstrate whether or not LDADE outperforms LDA-GA. Even if LDADE does not outperform LDA-GA, it is still okay. The paper can inform researchers to use LDADE for unsupervised tasks or tasks requiring short parameter tuning time or tasks requiring better stability, and LDA-GA for supervised tasks or tasks where parameter tuning can be done overnight. However, without the comparison, researchers may not be able to make such a decision in a future work.

[x] 1.3 great odea. see section ee e.wejwekwe wehre tid that at your suggestion. note that

-Reviewer 2

First of all, I would like to thank the authors for the effort they put to address my comments. The paper is improved a lot.

thank you for ath comment

But, I have still a couple of comments.

I was really surprised that with just 30 evaluations the algorithm is able to achieve a stable configuration. This is - in my opinion - a result that in some way belittles the issue raised in the paper. If by exploring only 30 different configurations I’m able to obtain a stable solution, probably LDA is in general stable and I don’t need a sophisticated approach to find a much stable solution.

[x] 2.1 note that DEs ar actually smarter than "do 30 evals" . . (1) samplig for a spalce that is contiually improving. invariant after 1 gen:every inte in the frontier is better than at least one other thing . so in gen2 when we pick 3 things, we are picking 3 good things. (2) DEs supprot vetor levelmutations that retain the assocaitions ebtween variables in a space. (which btw, other schemes like GAs, SA dont). text into the paper. reference it ere

I just need to perform some trials and pick the configuration that provides the best results. Indeed, which are the benefits of DE as compared with a Random Search? I would like to see in the paper a deeper discussion on the very low number of evaluations of the proposed algorithm and – if possible – a comparison with Random Search.

[x] 2.2 DO IT!

In addition, I concur with Reviewer 1 that the proposed algorithm should be experimented in a scenario more specific for the software engineering community, such as traceability link recovery, feature location, or software artefact labelling.

[x] 2.3 redo ahmed you got it

timm commented 7 years ago

[x] in fig15, caption can you expla F3CF7P30? on make the legend more comprehensibe
[x] btw why are figure 3,6,7,13,14,15 shittier than the rest? excel low res?
[x] for all figs, try to push the to the page whe they are mentioned in text. ie fig 16 17 18
[x] and watch out for table 8. before i fixed it is was 2 pages away from where it is discussed

timm commented 7 years ago

this is balls "owledge that for now we only validated the im-provement of LDADE over LDA in an unsupervised task (seeTable 3 and 8). The gain from unsupervised task may not beprominent when tested on an supervised task. We would like totest LDADE’s advantage in an supervised task"

u did do a supervised task fig 8,9,20

so why do you say different?

[ ] I wanted to refer to the SE supervised task which reviewers mentioned. I wanted to show our limitations. We only performed supervised task of categorizing StackExh data into relavant and non relevant which reviewers do not believe to be a supervised task

ai-se / Tuning-LDA

2nd IST review #7