Closed martinpopel closed 9 years ago
Thanks for information. You are right that multiword entities are not nouns. Do you see any other problems? Does it imply that in Wiktionary (manually created) are errors?
I actually used these lists to check whether noun is (can be) in plural form - did not see any problems.
I had not inspected the list properly, just seen the first few lines of each. There are also numbers etc. which could be better handled by rules (regex) if we decided that e.g. "1870s" fits the pluralia tantum category or that plural of "303" is "303s". On the other hand, I agree that simple csv files are easy to use in any programming language and that few extra MiB of memory are not a problem today.
This repository seems to contain huge lists with many false positives (e.g. multiword entities). Similar lists (but manually created and shorter) can be found here https://github.com/ufal/treex/tree/master/lib/Treex/Tool/EnglishMorpho/exceptions
It is a part of a bigger Perl-based project with modules for English morphology ("What are the possible parts of speech of 'can'?"), part-of-speech tagging (disambiguating the possible tags to the most probable variant within a sentence context) and lemmatization.
This issue does not report any error, feel free to close it. It is just a reference to similar project if someone needs it.