ai-se / Tuning-LDA

IST Journal Tuning LDA - LDADE
https://www.sciencedirect.com/science/article/pii/S0950584917300861
0 stars 0 forks source link

reply to ist reviewers #6

Open timm opened 7 years ago

timm commented 7 years ago

Comments from the editors and reviewers:

Both reviewers consider the research of great interest. LDA is a widely used techniques, and studying and improving its performance it a value research topic.

Both reviewers raised a number of substantial concerns that need to be looked into. Mainly they are related to

Reviewer 1

  1. Topic modeling has been used in many software engineering papers. The paper shows that topic modeling, especially LDA, suffers from order effects -- which is new and interesting. Depending on the order data is presented to LDA, it produces a different model. Search-based software engineering can be used to tune LDA parameters so that order effect is minimized. LDADE (LDA mixed with Differential Evolution) is presented. It is shown to improve standard LDA for various settings including classification of StackExchange websites into relevant and non-relevant documents -- which is good.

Thank you for those kind words.

There are a number of major concerns though:

1a. Very closely related work exists:

A prior work has demonstrated suboptimal performance of topic modeling, in particular LDA, when default parameters are used, and proposed a solution that addresses the problem using search-based software engineering [7]. Thus, the novelty of the work seems limited. The paper does not describe why LDADE is better than Panichella et al.'s work. Panichella et al.'s work (LDAGA) should have been prominently highlighted in the introduction of the paper and a short discussion should be given as to why another search-based solution is needed. [7] A. Panichella, B. Dit, R. Oliveto, M. Di Penta, D. Poshyvanyk, A. De Lucia, How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms, in: Proceedings of the 2013 International Conference on Software Engineering, IEEE Press, 2013, pp. 522–531.

1b. The experiments need to be improved in the following ways/ First, LDA has been used to help many realistic software engineering tasks (for example, tasks considered by papers included in Table 1. There is a need to expand the experiments to compare LDA and LDADE on those realistic software engineering tasks. It is unclear if the task considered in the experiments (Section 5.3) is realistic (why categorizing StackExchange data into relevant and non relevant categories useful?). More than one tasks should have been considered (similar like Panichella et al.'s work).

1c. Second, there is a need to compare Panichella's work (LDAGA) with LDADE on some realistic software engineering tasks. It is unclear if LDADE is better than LDAGA.

1d the results reported in Section 5.3 are questionable. Why not use LDADE to tune all 3 parameters of LDA (k, alpha, beta)? f k is fixed and alpha and beta are fixed too, what parameter(s) is tuned then?

1e: Why untuned (with k =10) is compared with tuned (with k = 20, 40, 80, and 200)? Shouldn't they be compared under the same setting of k?

1f: is k-fold cross validation used? Is training data kept separate from test data (including during the tuning phase)?

1g: Why F2 rather than F1 or F1/2? Is the difference statistically significant?

f1

1h: Minor issues:

Why need to evaluate across different platforms (Linux, Macintosh)? Wouldn't the platform be irrelevant? Related work is revised ~> is reviewed S2 ~> Section 2 (similarly for S1, S3, and so on) The period of time considered for drawing Table 1 needs to be mentioned. Should Table headings be at the top of tables? Need to confirm with IST guidelines. Reference 69 is incomplete.

Reviewer 2

SUMMARY: The authors empirically analyze the instability of LDA and the importance of its parameters’ tuning. Then the authors present LDADE, a search-based software engineering tool that tunes LDA parameters using DE (Differential Evolution). An empirical study indicate that by using LDADE, the topic modeling technique is much more stable and reliable.

EVALUATION: This is a very interesting paper. It is in general well-written and easy to follow. I really like the goal of the paper. Having worked with LDA I totally agree that using the technique as a “black-box” is not recommended. So, the idea of using LDADE to (i) improve the stability of LDA and (ii) reduce the threats that could affect the validity of the results achieved due to an incorrect configuration of the technique is really important. Thus, I think that the paper has a great potential.

Thank you for those kind words

2a: The first issue is related to the description of LDADE. I have to admit that Section 4.3 of the paper is quite difficult to follow. First of all, I strongly suggest to the authors to provide a “bird-eye-view” of the approach before providing the details. Which is the output of LDADE? How can I use such an output? If I understood correctly, the output of LDADE is just a set of topics (similar to the output of LDA). Is this correct? If so, the authors should explicit mention this. Also, an usage scenario of LDADE could be worthwhile.

2b: In addition, I did not understand at all when and how LDADE varies the parameters of LDA (k, alpha and beta). LDA in LDADE is invoked through the function ldascore. However, in such a function k, alpha and beta are set to default values.

2c: I appreciated that the authors reported the pseudo-code of LDADE. However, I think that a much deeper description of each step is required. The authors should explain each step of the algorithm and more important should define each function/variable of the pseudo-code. For instance, Data is never instantiated in Algorithm 2 (probably because it is an input?). Or, which is l[0] in Algorithm 1. Another imprecision is related to the function ldascore. From Algorithm 1, ldascore takes as input n and Data. In Algorithm 2 ldascore is called at line 11 passing as parameter Si (a population?) and at line 12 Si, n and Data. Also, Cur_Gen is used as a matrix. However, when calling on Cur_Gen the method “add”, four parameters are passed. I understand that it is just a pseudo-code. However, a much higher level of formality is required.

2d: I still have some doubts about how DE is used. Specifically, I would like to see much more details on the technique (to make the paper self-contained) and (much important) the design of the approach. For instance, which is the encoding of each solution? Which are the operators used to evolve the solutions? Which is the quality/fitness function? Looking at the pseudo-code, it seems that the quality function is represented by ldascore. If so, why scores are encoded in the solution?

2e: The authors provide empirical evidence that LDA used with default configuration parameters is not stable. What about if LDA is configured properly - for instance by using LDA-GA? In other words, which are the results achieved if in the algorithm 1 instead of using LDA, the authors try to use LDA-GA?

2f: Could the instability be due to an incorrect configuration of the LDA parameters? This is a critical point that should be addressed in the paper.

2g: Turning to the empirical evaluation, the main problem here is related to the lack of a deeper analysis of the results achieved. For instance, looking at Figure 6 it seems that the results achieved on “PitsC” are quite stable. Why? Did the authors have some clues on why on this particular dataset the untuned LDA provides quite acceptable results? Here a qualitative analysis of the results achieved could be worthwhile.

2h: A qualitative analysis could be also useful to better explain why LDADE provides much more benefits on VEM than Gibbs. Note that this result also confirms the results achieved in the literature about the higher stability of Gibbs as compared to other implementation of LDA.

2i: Very interesting is the analysis of the Delta score. However, in some cases the improvement in terms of stability is not so high. Also here could be worthwhile to discuss in details such cases. In addition, is there any statistical significant difference between the achieved scores? An analysis of the results supported by statistical tests could strengthen the findings.

timm commented 7 years ago

24 out of 28 papers from Table 2 uses unsupervised LDA for further analysis.

timm commented 7 years ago
timm commented 7 years ago

1e

timm commented 7 years ago

1g

timm commented 7 years ago

21

timm commented 7 years ago

contrinutions of this

timm commented 7 years ago

image

image

timm commented 7 years ago
amritbhanu commented 7 years ago

reply pdf

updated_paper