Andrian Marcus, SE+text

Data type:

Motivation of Text Mining in SE:

SE Data: In many software projects the amount of the unstructured information exceeds the size of the source code by one order of magnitude. Software artifacts written in natural language (e.g., requirements, design documents, user manuals, use case scenarios, bug reports, developers’ messages, etc.), together with the source code comments and identifiers encode, to a large degree, the domain and developers’ knowledge; they capture design, application domain, developers’ decisions, developer choices, stockholders requirements, and the overall software advancement.
Retrieving and analyzing the textual information present in the software are extremely important in supporting program comprehension and a variety of software evolution tasks
Mining and analyzing textual information from internet-based sources, such as, Stack Overflow, app markets, etc. and use this information to gain new insights, build recommendation systems or simply mine knowledge. This gathered information is then used to support processes and development activities.

20 different SE tasks

Message: We argue that the use of TR and NLP in software is one of the fastest growing areas of research in SE.

Tools:

Text Retrieval
- Vector Space Model (e.g. term frequency matrix)
- Latent Semantic Analysis (SVD, find correlation between term and document)
- Latent Dirichlet Association (assume distribution)
- Language Models (Probability Model)
Natural Language Processing
- part-of-speech tagging (marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph.)
- stemming (reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.)
- stopword elimination (stop words usually refer to the most common words in a language)
- semantics analysis (relating syntactic structures, from the levels of phrases, clauses, sentences and paragraphs to the level of the writing as a whole, to their language-independent meanings.)
- sentiment analysis (identify and extract subjective information in source materials, determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document)

Task: Detect link between code and document

Method: VSM (vector space model), LSI, SVD

Data: LEDA (Library of Efficient Data types and Algorithms)

Feature location: search+impact analysis

Decision making: LSI and scenario-based probabilistic ranking (SPR), decision fusion

Data: Mozilla, Eclipse, find bugs...

Task: Code labeling (generate doc from source code)

Method: a) Stereotype Identification b)filtering for methods c) Text Generation

azhe825 / Literature-Review