Open azhe825 opened 8 years ago
DPC: document i belongs to which categories, new documents may come in. CPC: category j contains which documents, new category may be added.
Output: hard vs. ranking. Ranking is especially useful when a human expert make the final decision.
Application:
DIA: more features. tf+idf, locations of term j in document i.
Dimensionality Reduction:
Document similarity: cosine distance
Online: perceptron, WINNOW
Evaluation: Microaveraging mu Macroaveraging M (for precision and recall)
Feature Selection:
DF, IG, CHI have strong correlations: DF is most costless, can be used instead of CHI and IG.
Notice that DF is unsupervised while IG and CHI are supervised. NO better for supervised feature selection?
Performance Metrics: Precision and Recall
SVM + K-NN + LLSF + Linear least square fit F=arg min |FA-B|^2 ANN NB
SVM, K-NN, LLSF works better for small data sets
Macro-averaged and micro-averaged F score can find support here
Skewed data sets (Imbalance)
Significance tests
Key word: Embedding of text regions word vector is word embedding
In its simplest form, one-hot CNN works as follows. A document is represented as a sequence of one-hot vectors (each of which indicates a word by the position of a 1); a convolution layer converts small regions of the document (e.g., “I love it”) to low-dimensional vectors at every location (embedding of text regions); a pooling layer aggregates the region embedding results to a document vector by taking component-wise maximum or average; and the top layer classifies a document vector with a linear model (Figure 1). The one-hot CNN and its semi-supervised extension were shown to be superior to a number of previous methods
shortcomings: region size fixed.
LSTM is quite new in text mining (2014)
Data sets: IMDB, Elec, RCV1 (second-level topics), and 20-newsgroup (20NG)
Part of the model can be trained on unlabeled data.
Feature Selection GA
More than 80% of information is stored as text (Korde & Mahender, 2012); therefore, text categorization is an important task in machine learning and data mining for organizing a massive amount of information (Yun, Jing, Yu, & Huang, 2012).
Classifier: Associative Classification
Feature selection: Filter approach , Wrapper approach (optimization for feature selection)
Evaluation: NB performance - Feature dimension
multi-label text classification
Extreme learning machine (ELM)
group sparsity, L21-norm
ANN is more active in text mining these days.