This repository contains a reading list of Software Engineering papers and articles!
0 stars 0 forks source link

Paper Review: Detection of Domain Specific Terminology Using Corpora Comparison #75

Open Md-Nizamuddin opened 3 months ago

Md-Nizamuddin commented 3 months ago


University of Montreal

Link to The Paper


Name of The Authors

Patrick Drouin

Year of Publication



This paper evaluates the usefulness of the corpora comparison approach to identify uniterms in the field of communication. It describes a technique that allows term extractors to discriminate between lexical items in the form of potential uniterms. This technique is based on comparing the lexicon of an Analysis Corpus (AC) with a larger Reference Corpus (RC). The AC is a subset of the technical corpus to identify terms specific to it. This relies on a statistical comparison of word frequencies in the AC and RC. A list of specific words is created by TermoStat (a program previously developed by the author) which quantifies the deviation from a normal distribution and a threshold is set such that the frequency observed in the AC is not incidental. RC consists of 13746 articles from The Gazette newspaper with 7.4M tokens and 82.7K word forms. All corpora used were tokenized and tagged with Brill's rule-based part-of-speech tagger. The TermoStat performs root-form analysis on the nouns of the corpus to work with lemmas.

The Validation consists of 2 steps:-

  1. Automatic- consists of a comparison of the identified subset of the lexicon with a list of terms found in a multilingual terminology database (61K English terms). Here authors want to evaluate how valuable their comparisons are within the field of terminology studies. They believe that the information in specialized terminology databases gives a good, though not perfect, indication of what terminologists need.
  2. Human Validation- A list from one of the ACs was submitted to 3 terminologists specialists in the telecommunication industry where they consider the entry as valid if it represents the domain or main topic of the corpus.

The validation process showed that 84.1% of the words identified in AC1, 86.1% in AC2, and 73.0% in AC3 were considered relevant. Performed poorly with words with a frequency below 5.

Contributions of The Paper

Described the TermoStat software that employs statistical measures to identify specific words in a technical corpus. Introduced a two-stage term extraction technique that uses corpus comparison to isolate domain-specific terminology. A method that compares word frequencies by opposing technical and non-technical corpora with a high level of precision obtained which indicates that the corpus-specific words are useful for day-to-day terminology work.


There is a need for considering semantic aspects in addition to statistical measures.