RAISEDAL / RAISEReadingList

This repository contains a reading list of Software Engineering papers and articles!
0 stars 0 forks source link

Paper Review: Automatic detection of new domain-specific words, using document classification and frequency profiling #77

Open Md-Nizamuddin opened 3 months ago

Md-Nizamuddin commented 3 months ago

Publisher

University of Birmingham

Link to The Paper

https://www.birmingham.ac.uk/Documents/college-artslaw/corpus/conference-archives/2005-journal/Lexiconodf/asmussenpaper.pdf

Name of The Authors

Jørg Asmussen

Year of Publication

2005

Summary

This research proposes an approach for automatically determining new domain-specific words to facilitate updating the Danish Dictionary DDO using 'known' domain-specific vocabularies derived from DDOC. Every domain-specific text has a domain label, comprising 66 domains. New domain-specific vocabulary is then detected by statistically determining salient words in the new text material that are not yet registered in the dictionary. DDOC (43k text sample - 40M words) - 66 different vocabularies. It follows a primitive tokenizer without any preprocessing. It uses a statistical significance test using log-likelihood for the comparison as it yields acceptable results within the corpus comparison framework. We have 2 types of uncertainties here: (a) Significance level where a threshold p>= 0.99 is used to determine which words are important enough to be included in each domain's vocabulary. A high threshold means fewer but more significant words. (b) The common words are included due to their role in phrases but their high frequency is managed to avoid absurd results. For Classification: The words in the text with each domain's vocabulary where the domain with the most words in common is chosen as the classification for the text. A word with a higher frequency matters more than the one with less occurrence. The vocabulary size is balanced using the 1/sqrt(D) approach for domain score. The more important relevant words are given a high rank in a domain. At last, the text is assigned to the domain with the highest score.

Contributions of The Paper

This automates the process that traditionally requires a lot of manual labor. There is the creation of domain-specific vocabularies from an extensive corpus of text samples. He used mathematical formulas (statistical test-log likelihood method) to give an effective text classification method.

Comments

Mathematical Approach