clips / pattern

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
https://github.com/clips/pattern/wiki
BSD 3-Clause "New" or "Revised" License
8.75k stars 1.58k forks source link

Improved the singularize method in inflect.py #220

Open TanyaaCJain opened 6 years ago

TanyaaCJain commented 6 years ago

Though 95% accuracy was previously achieved by measuring via CELEX English morphology word forms, the following changes have incremented the accuracy to 99%

  1. Added more words to the set singular_uninflected

  2. In the singularize method, changed the if-condition for the set singular_uninflected from if x.endswith(w): return word to if x == w or w == x + "s": return x because the former statement considered the words in the set to be word endings. Hence, it affected words with a prefix to the words in the set. The new condition checks if the word passed in the argument is present in the given list as it is or with a succeeding "s" and then returns the word's singular form from the list and not the word, which may be passed in a plural form.

  3. Added more words to the list singular_uncountable categorized via commenting such as abstract ideas and expressions, natural phenomena, general, etc for ease of reading and understanding

  4. Added more words to the list singular_ie and dictionaries singular_irregular

  5. Certain words which could be grouped via regex instead of adding in the above-mentioned lists and dictionaries were written in the form of regular expressions (regex) in the singular_rules.

  6. In singularize method, changed the if-condition for the dictionary singular_irregular from if w.endswith(x): to if x == w: because the former considered the word or key x in the dictionary to be an ending to the word passed as an argument to the singularize method. The latter condition checks whether the word w passed as argument is present in the dictionary by equating it to x. If True, it returns the singularized form of word w, that is, singular_irregular[x]

  7. Added more regex expressions to the list singular_rules to suit the singularization rules and improve the accuracy of the singularize method.

  8. Henceforth, this commit solves the following issues opened currently Issue - singularized on - earlier effect - current effect

    141 , #175 - flour - flmy - flour

    141 - colour - colmy - colour

    141 - your - ymy - your

    141 - olives - olife - olive

    176 - hummus - hummu - hummus

  9. The words added to sets singular_uninflected and singular_uncountable were also added to the lists in dictionary plural_categories["uninflected"] and plural_categories["uncountable"] for consistency.

It is to keep in mind that the 99% accuracy is reported after being tested from the corpora/test_en.py and is subject to the dataset of CELEX English morphology word forms only.