ai-se / ZheYu

Zhe's works
0 stars 0 forks source link

Literature Text Categorization #6

Closed azhe825 closed 8 years ago

azhe825 commented 8 years ago
azhe825 commented 8 years ago

DPC: document i belongs to which categories, new documents may come in. CPC: category j contains which documents, new category may be added.

Output: hard vs. ranking. Ranking is especially useful when a human expert make the final decision.

Application:

  1. Boolean systems (document indexing, put key words on a document) fixed vocabulary of key words. Multi-label, multi-classification
  2. Document Organization. single-label, multi-classification
  3. Text Filtering. binary-classification
  4. Word Sense Disambiguation. Context related.
  5. Hierarchical Categorization of Web Pages. Ranking, CPC, links are useful.

DIA: more features. tf+idf, locations of term j in document i.

Dimensionality Reduction:

  1. df
  2. supervised. probability. P(C|t)=P(t|C)_P(C)/(P(t|C)_P(C)+P(t|C-)*P(C-))
  3. term clustering,
  4. LSI. not interpretable

Document similarity: cosine distance

Online: perceptron, WINNOW

Evaluation: Microaveraging mu Macroaveraging M (for precision and recall)