This research proposes an approach for automatically determining new domain-specific words to facilitate updating the Danish Dictionary DDO using 'known' domain-specific vocabularies derived from DDOC. Every domain-specific text has a domain label, comprising 66 domains. New domain-specific vocabulary is then detected by statistically determining salient words in the new text material that are not yet registered in the dictionary. DDOC (43k text sample - 40M words) - 66 different vocabularies. It follows a primitive tokenizer without any preprocessing. It uses a statistical significance test using log-likelihood for the comparison as it yields acceptable results within the corpus comparison framework.
We have 2 types of uncertainties here: (a) Significance level where a threshold p>= 0.99 is used to determine which words are important enough to be included in each domain's vocabulary. A high threshold means fewer but more significant words. (b) The common words are included due to their role in phrases but their high frequency is managed to avoid absurd results.
For Classification: The words in the text with each domain's vocabulary where the domain with the most words in common is chosen as the classification for the text. A word with a higher frequency matters more than the one with less occurrence. The vocabulary size is balanced using the 1/sqrt(D) approach for domain score. The more important relevant words are given a high rank in a domain. At last, the text is assigned to the domain with the highest score.
Contributions of The Paper
This automates the process that traditionally requires a lot of manual labor.
There is the creation of domain-specific vocabularies from an extensive corpus of text samples.
He used mathematical formulas (statistical test-log likelihood method) to give an effective text classification method.
Publisher
University of Birmingham
Link to The Paper
https://www.birmingham.ac.uk/Documents/college-artslaw/corpus/conference-archives/2005-journal/Lexiconodf/asmussenpaper.pdf
Name of The Authors
Jørg Asmussen
Year of Publication
2005
Summary
This research proposes an approach for automatically determining new domain-specific words to facilitate updating the Danish Dictionary DDO using 'known' domain-specific vocabularies derived from DDOC. Every domain-specific text has a domain label, comprising 66 domains. New domain-specific vocabulary is then detected by statistically determining salient words in the new text material that are not yet registered in the dictionary. DDOC (43k text sample - 40M words) - 66 different vocabularies. It follows a primitive tokenizer without any preprocessing. It uses a statistical significance test using log-likelihood for the comparison as it yields acceptable results within the corpus comparison framework. We have 2 types of uncertainties here: (a) Significance level where a threshold p>= 0.99 is used to determine which words are important enough to be included in each domain's vocabulary. A high threshold means fewer but more significant words. (b) The common words are included due to their role in phrases but their high frequency is managed to avoid absurd results. For Classification: The words in the text with each domain's vocabulary where the domain with the most words in common is chosen as the classification for the text. A word with a higher frequency matters more than the one with less occurrence. The vocabulary size is balanced using the 1/sqrt(D) approach for domain score. The more important relevant words are given a high rank in a domain. At last, the text is assigned to the domain with the highest score.
Contributions of The Paper
This automates the process that traditionally requires a lot of manual labor. There is the creation of domain-specific vocabularies from an extensive corpus of text samples. He used mathematical formulas (statistical test-log likelihood method) to give an effective text classification method.
Comments
Mathematical Approach