Though 95% accuracy was previously achieved by measuring via CELEX English morphology word forms, the following changes have incremented the accuracy to 99%
Added more words to the set singular_uninflected
In the singularize method, changed the if-condition for the set singular_uninflected from
if x.endswith(w): return word to if x == w or w == x + "s": return x
because the former statement considered the words in the set to be word endings. Hence, it affected words with a prefix to the words in the set.
The new condition checks if the word passed in the argument is present in the given list as it is or with a succeeding "s" and then returns the word's singular form from the list and not the word, which may be passed in a plural form.
Added more words to the list singular_uncountable categorized via commenting such as abstract ideas and expressions, natural phenomena, general, etc for ease of reading and understanding
Added more words to the list singular_ie and dictionaries singular_irregular
Certain words which could be grouped via regex instead of adding in the above-mentioned lists and dictionaries were written in the form of regular expressions (regex) in the singular_rules.
In singularize method, changed the if-condition for the dictionary singular_irregular from
if w.endswith(x): to if x == w:
because the former considered the word or key x in the dictionary to be an ending to the word passed as an argument to the singularize method. The latter condition checks whether the word w passed as argument is present in the dictionary by equating it to x. If True, it returns the singularized form of word w, that is, singular_irregular[x]
Added more regex expressions to the list singular_rules to suit the singularization rules and improve the accuracy of the singularize method.
Henceforth, this commit solves the following issues opened currently
Issue - singularized on - earlier effect - current effect
141 , #175 - flour - flmy - flour
141 - colour - colmy - colour
141 - your - ymy - your
141 - olives - olife - olive
176 - hummus - hummu - hummus
The words added to sets singular_uninflected and singular_uncountable were also added to the lists in dictionary plural_categories["uninflected"] and plural_categories["uncountable"] for consistency.
It is to keep in mind that the 99% accuracy is reported after being tested from the corpora/test_en.py and is subject to the dataset of CELEX English morphology word forms only.
Though 95% accuracy was previously achieved by measuring via CELEX English morphology word forms, the following changes have incremented the accuracy to 99%
Added more words to the set singular_uninflected
In the singularize method, changed the if-condition for the set singular_uninflected from
if x.endswith(w): return word
toif x == w or w == x + "s": return x
because the former statement considered the words in the set to be word endings. Hence, it affected words with a prefix to the words in the set. The new condition checks if the word passed in the argument is present in the given list as it is or with a succeeding "s" and then returns the word's singular form from the list and not the word, which may be passed in a plural form.Added more words to the list singular_uncountable categorized via commenting such as abstract ideas and expressions, natural phenomena, general, etc for ease of reading and understanding
Added more words to the list singular_ie and dictionaries singular_irregular
Certain words which could be grouped via regex instead of adding in the above-mentioned lists and dictionaries were written in the form of regular expressions (regex) in the singular_rules.
In singularize method, changed the if-condition for the dictionary singular_irregular from
if w.endswith(x):
toif x == w:
because the former considered the word or key x in the dictionary to be an ending to the word passed as an argument to the singularize method. The latter condition checks whether the word w passed as argument is present in the dictionary by equating it to x. If True, it returns the singularized form of word w, that is, singular_irregular[x]Added more regex expressions to the list singular_rules to suit the singularization rules and improve the accuracy of the singularize method.
Henceforth, this commit solves the following issues opened currently Issue - singularized on - earlier effect - current effect
141 , #175 - flour - flmy - flour
141 - colour - colmy - colour
141 - your - ymy - your
141 - olives - olife - olive
176 - hummus - hummu - hummus
The words added to sets singular_uninflected and singular_uncountable were also added to the lists in dictionary plural_categories["uninflected"] and plural_categories["uncountable"] for consistency.
It is to keep in mind that the 99% accuracy is reported after being tested from the corpora/test_en.py and is subject to the dataset of CELEX English morphology word forms only.