Paper Review: Classifying Domain-Specific Terms Using a Dictionary

Publisher

ACL Anthology

Link to The Paper

https://aclanthology.org/U11-1009.pdf

Name of The Authors

SuNamKim, Lawrence Cavedon

Year of Publication

2011

Summary

The paper focuses on assigning domain concepts to domain-specific terms similar to building a taxonomy from dictionaries or semantic class labeling. This is highly used in NLP tasks such as WSD, NER, and Query Expansion. The content needs regular updates and involves 2 main tasks (Extracting Domain Specific terms and Assigning domain concepts). This approach is closer to corpus-based WSD, which uses the co-occurrence of terms between 2 corpora. Data is used from FOLDOC mainly as 14826 unique terms with multiple senses thus having 16450 total terms including (13072 direct of which 8621 were manually assigned) & 3378 redirects). These produced 188 domain concepts [by coarse-grained word senses] forming 9 domain concepts as super-labels. N-gram-based Bag-of-Words (BoW) measures the semantic similarity between terms and texts. Even the domain concepts of dictionary terms were used in extended definitions. LDA was used for topic modeling., It was also hypothesized that one topic is associated with one domain concept, thus a topic ID was assigned per dictionary terms using topic modeling software and used this as an additional feature with BoWs.

In the experiment, text was replaced with its categories and then POS tagging and lemmatization was performed. SVM was used for supervised learning, and 10-fold cross-validation was performed over 8621 terms with manually assigned labels in FOLDOC. The best performance was produced by 1- & 2-grams with frequency ≥1. Adding rich semantic features (domain concepts, “topics”) improved performance by approximately 8.6%. Poor performance was produced when using extended semantic features as they tend to introduce more erroneous instances. SVMlin was used for semi-supervised learning with the same dataset for test & train and 4451 unlabeled terms in FOLDOC as unlabeled data. The increase in the size of training data did not significantly improve the performance. The use of simple BoWs improved performance but didn’t exceed the best performance produced using BoW + Domain Concept. Even adding Domain Concept decreased performance when adding more unlabeled data ≥ 1000. The performances were compared using a micro-averaged F-score.

Contributions of The Paper

Proposed an automatic method to assign domain concepts to terms in FOLDOC using various contextual features as well as semantic features - Domain Concept and Topic.

Demonstrated that the system performed best when using rich semantic features directly derived from dictionary terms

Showed that for the target task, semi-supervised learning did not significantly improve performance, unlike for other tasks.

Comments

Currently uses SVM model, for improvement, DeepLearning Model like CNN can be used if possible.

RAISEDAL / RAISEReadingList