EFord36 / normalise

A module for normalising text.
GNU General Public License v3.0
173 stars 33 forks source link

User abbreviation not working properly #109

Closed javidalkaruzi closed 5 years ago

javidalkaruzi commented 7 years ago

Good day,

I am trying your library but it is not working as expected. I am trying to use custom abbreviation but it does not replace the abbreviation with the equivalent word. I have even tried your example that is listed in the read me and I got the following output.

['four bdrm', 'house', 'for', 'sale', ',', 'four hundred and fifty nine thousand pounds', 'ONO']

pikaliov commented 5 years ago

I have same problem

derekhsu commented 5 years ago

Me, too. Was this fixed already?

EFord36 commented 5 years ago

Hi,

Short version

can you try downloading nltk's averaged_perceptron_tagger and universal_tagset in your environment?

import nltk
nltk.download("averaged_perceptron_tagger")
nltk.download("universal_tagset")

Long version

Apologies for slowness with the issue

Sorry for the very slow reply to the original issue - I had been meaning to look into it at the time but never got around to it and forgot about it completely. Unfortunately, maintaining this repo hasn't been a top priority for me. One of the reasons I didn't put more effort into tracking down the issue at the time was that it was something I couldn't reproduce on my laptop and I feared it would be tricky to debug.

Root cause

However, I've just looked into it and I believe I've found the cause of the problem. I found I was able to reproduce the issue in a fresh environment. For me, this bare except was causing issues - it's a last-ditch effort to just ensure that we still get some output even if something's gone wrong, and just return the original 'non-standard word' before we start trying to normalise it. However, this except clause was covering up the fact that nltk resources that the repo expects might not be present.

Follow up actions

If I were writing this program again, I wouldn't rely on this bare except, which can cover up a multitude of sins (as it did here). Unfortunately, I don't have time at the moment for a substantial rewrite, but I should be able to find time to add an extra except clause for a LookupError, which is what is raised by nltk when it doesn't have the expected resources. That should make the required download transparent to other users. Does that sound reasonable as a mitigation?

Thanks everyone on the ticket for pointing out the issue, and sorry again for the slow responce!

EFord36 commented 5 years ago

Hi,

There is a fix for this in release 0.1.8 which is on both pypi and Github. The fix reaises the LookupErrors so that it is clear to the user what needs to be done. There doesn't appear to be a standard pattern for including nltk data dependencies in python packages, but I'll be sure to update this ticket if I find a solution which install the relevant dependencies automatically, which would obviously be a better experience for users than having to run the package repeatedly and deal with the errors.

Thanks for pointing out the issue!

Best,

Elliot