azhe825 / Literature-Review

0 stars 0 forks source link

Text Categorization #1

Open azhe825 opened 8 years ago

azhe825 commented 8 years ago
azhe825 commented 8 years ago

DPC: document i belongs to which categories, new documents may come in. CPC: category j contains which documents, new category may be added.

Output: hard vs. ranking. Ranking is especially useful when a human expert make the final decision.

Application:

  1. Boolean systems (document indexing, put key words on a document) fixed vocabulary of key words. Multi-label, multi-classification
  2. Document Organization. single-label, multi-classification
  3. Text Filtering. binary-classification
  4. Word Sense Disambiguation. Context related.
  5. Hierarchical Categorization of Web Pages. Ranking, CPC, links are useful.

DIA: more features. tf+idf, locations of term j in document i.

Dimensionality Reduction:

  1. df
  2. supervised. probability. P(C|t)=P(t|C)_P(C)/(P(t|C)_P(C)+P(t|C-)*P(C-))
  3. term clustering,
  4. LSI. not interpretable

Document similarity: cosine distance

Online: perceptron, WINNOW

Evaluation: Microaveraging mu Macroaveraging M (for precision and recall)

azhe825 commented 8 years ago

Feature Selection:

  1. Document Frequency (DF) Unsupervised
  2. Information Gain (IG) +, Supervised G(t)=-P(c)lg(P(c))+P(c|t)lg(P(c|t))+P(c|t-)lg(P(c|t-))
  3. Mutual Information (MI), Supervised I(t,c)=lg(P(t,c)/(P(t)P(c)))
  4. x^2 test (CHI) +
  5. term strength (TS)

DF, IG, CHI have strong correlations: DF is most costless, can be used instead of CHI and IG.

Notice that DF is unsupervised while IG and CHI are supervised. NO better for supervised feature selection?

Performance Metrics: Precision and Recall

azhe825 commented 8 years ago

SVM + K-NN + LLSF + Linear least square fit F=arg min |FA-B|^2 ANN NB

SVM, K-NN, LLSF works better for small data sets

Macro-averaged and micro-averaged F score can find support here

Skewed data sets (Imbalance)

Significance tests

azhe825 commented 8 years ago

Key word: Embedding of text regions word vector is word embedding

In its simplest form, one-hot CNN works as follows. A document is represented as a sequence of one-hot vectors (each of which indicates a word by the position of a 1); a convolution layer converts small regions of the document (e.g., “I love it”) to low-dimensional vectors at every location (embedding of text regions); a pooling layer aggregates the region embedding results to a document vector by taking component-wise maximum or average; and the top layer classifies a document vector with a linear model (Figure 1). The one-hot CNN and its semi-supervised extension were shown to be superior to a number of previous methods

shortcomings: region size fixed.

LSTM is quite new in text mining (2014)

image

Data sets: IMDB, Elec, RCV1 (second-level topics), and 20-newsgroup (20NG)

Part of the model can be trained on unlabeled data.

azhe825 commented 8 years ago

Feature Selection GA

More than 80% of information is stored as text (Korde & Mahender, 2012); therefore, text categorization is an important task in machine learning and data mining for organizing a massive amount of information (Yun, Jing, Yu, & Huang, 2012).

Classifier: Associative Classification

Feature selection: Filter approach , Wrapper approach (optimization for feature selection)

Evaluation: NB performance - Feature dimension

azhe825 commented 8 years ago

multi-label text classification

Extreme learning machine (ELM)

group sparsity, L21-norm

ANN is more active in text mining these days.