unstructured information (e.g., natural language text).
Motivation of Text Mining in SE:
SE Data: In many software projects the amount of the unstructured information exceeds the size of the source code by one order of magnitude. Software artifacts written in natural language (e.g., requirements, design documents, user manuals, use case scenarios, bug reports,
developers’ messages, etc.), together with the source code comments and identifiers encode, to a large degree, the domain and developers’ knowledge; they capture design, application
domain, developers’ decisions, developer choices, stockholders requirements, and the overall software advancement.
Retrieving and analyzing the textual information present in the software are extremely important in supporting program comprehension and a variety of software evolution tasks
Mining and analyzing textual information from internet-based sources, such as, Stack Overflow, app markets, etc. and use this information to gain new insights, build recommendation systems or simply mine knowledge. This gathered information is then used to support processes and development activities.
20 different SE tasks
traceability link recovery
concern/concept/feature/bug location
software search
change impact analysis
requirements analysis
bug triage
refactoring
defect prediction
software redocumentation
Message: We argue that the use of TR and NLP in software is one of the fastest growing areas of
research in SE.
Tools:
Text Retrieval
Vector Space Model (e.g. term frequency matrix)
Latent Semantic Analysis (SVD, find correlation between term and document)
Latent Dirichlet Association (assume distribution)
Language Models (Probability Model)
Natural Language Processing
part-of-speech tagging (marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph.)
stemming (reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.)
stopword elimination (stop words usually refer to the most common words in a language)
semantics analysis (relating syntactic structures, from the levels of phrases, clauses, sentences and paragraphs to the level of the writing as a whole, to their language-independent meanings.)
sentiment analysis (identify and extract subjective information in source materials, determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document)
The use of text retrieval and natural language processing in software engineering
Data type:
Motivation of Text Mining in SE:
20 different SE tasks
Message: We argue that the use of TR and NLP in software is one of the fastest growing areas of research in SE.
Tools:
Recovering Documentation-to-Source-Code Traceability Links using Latent Semantic Indexing
Task: Detect link between code and document
Method: VSM (vector space model), LSI, SVD
Data: LEDA (Library of Efficient Data types and Algorithms)
Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval
Feature location: search+impact analysis
Decision making: LSI and scenario-based probabilistic ranking (SPR), decision fusion
Data: Mozilla, Eclipse, find bugs...
Automatic generation of natural language summaries for java classes
Task: Code labeling (generate doc from source code)
Method: a) Stereotype Identification b)filtering for methods c) Text Generation