Identification of Labeled Data Sources

mrpozzi commented 7 years ago

This entails both a literature review to understand how they have been labeled and some data mining to find already labeled text basis to build sentiment index

mkao006 commented 7 years ago

There are some Reuters Corpus that has been labelled already. Maybe we can build a model based on these data.

mkao006 commented 7 years ago

Here is a list of dataset for me to investigate.

[x] Wordnet and related projects
[x] eXtended Wordnet
[x] WordNet 2.0
[ ] Movie review data
[ ] NLP related dataset on data hub
[ ] The Ontologies of Linguistic Annotations

Notes:

The eXtended wordnet extends and focus more on concept and topic for knowledge extraction.
Wordnet 2.0 is just a conversion of Wordnet to RDF/OWL.

Also conduct literature reviews on these datasets to see how they are labeled.

Potential papers:

[ ] An Effective and Robust Method for Short Text Classification (Reuters)
[x] Experiments with multi-label text classifier on the Reuters collection
- If I understood the paper correctly, the algorithm is similar to a nearest neighbor method where the term frequency of the topic term frequency vector is scaled towards the centroid of the vector space where the documents have been classified as the same topic. That is, the topic vector is matched to the centroid of the articles.
[ ] Performance Measurement Framework for Hierarchical Text Classification (Reuters)
[x] A systematic analysis of performance measures for classification tasks
- The paper talks about different performance measures such as accuracy, precision etc and their invariance property. The desirable measure will depend on our definition of the problem statement and goal.
[ ] Machine Learning in Automated Text Categorization (50 pages)
[ ] Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation
[ ] Sentiment Analyzer: Extracting Sentiments about a Given Topic using Natural Language Processing Techniques
[ ] Thumbs up? Sentiment Classification using Machine Learning Techniques
[ ] Information Extraction as a Basis for High-Precision Text Classification

EST-Team-Adam / TheReadingMachine

Identification of Labeled Data Sources #10