clips / pattern

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
https://github.com/clips/pattern/wiki
BSD 3-Clause "New" or "Revised" License
8.75k stars 1.58k forks source link

[BUG] The Singularize function is extremely bad TBH #254

Open tim5go opened 5 years ago

tim5go commented 5 years ago

The built-in singularize function yields lots of false positives:

Here're some examples: 1) business 2) virginia 3) tour 4) loss

It ends up I need to define a self-maintained exception dictionary, which is really inconvenient. I know it's hard to cover all cases, but some of the false positives are really trivial. I am quite disappointed given this repo receives lots of stars.

fuzheado commented 5 years ago

Interesting. Those are indeed problematic. I've generally been happy with the vast majority of "singularized" words, but I'll add a few that were problems for me:

cross->cros goddess->goddes sadness->sadnes sarcophagus->sarcophagu putti->puttus (should be putto) world war ii->world war ius

Also adding the errors as described above:

business->business virginia->virginium tour->tmy loss->los

AdLucem commented 5 years ago

Hi, I'm a new contributor to this repo and I'd like to try my hand at solving this.

The documentation states that pattern.text.en.inflect's singularize function- which seems to be the problem here- has been adapted from this repo: https://github.com/bermi/Python-Inflector. Was it directly taken, or were there changes made? I'm wondering if I should wander over to the inflector to see how the singularize algorithm works, or just look at the one here.

ndvbd commented 5 years ago

Also I get viruses->viruse - Is there an updated model file or something? Is it solved in 3.6?