EFord36 / normalise

A module for normalising text.
GNU General Public License v3.0
173 stars 33 forks source link

wrong normalization #117

Closed cmatosve closed 5 years ago

cmatosve commented 6 years ago

Hello, I am trying to use your tool but I am having a wrong output using the example you provide in your paper.

The input from the paper: On the 13 Feb. 2007, Rt. Hon. Theresa May MP announced on ITV News that the rate of childhod obesity had risen from 7.3-9.6% in just 3 years, costing the Gov. £20m #politics.

The output from the tool is this: On the thirteenth of Feb. two thousand and seven , Rt. Hon. Theresa May M P announced on I T V News that the rate of childhood obesity had risen from seven point three to nine point six percent in just three years , costing the Gov. twenty million pounds hashtag politics

It looks like the class EXPN is not working. I have a warning when running the tool:

UserWarning: Trying to unpickle estimator LabelPropagation from version 0.18 when using version 0.19.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)

Do you have any suggestions for solving this problem?

Thanks!

EFord36 commented 5 years ago

Hi,

Sorry for the slow reply. This looks like a duplicate of issue #109 - see this comment:

Hi,

Short version

can you try downloading nltk's averaged_perceptron_tagger and universal_tagset in your environment?

import nltk
nltk.download("averaged_perceptron_tagger")
nltk.download("universal_tagset")

Long version

Apologies for slowness with the issue

Sorry for the very slow reply to the original issue - I had been meaning to look into it at the time but never got around to it and forgot about it completely. Unfortunately, maintaining this repo hasn't been a top priority for me. One of the reasons I didn't put more effort into tracking down the issue at the time was that it was something I couldn't reproduce on my laptop and I feared it would be tricky to debug.

Root cause

However, I've just looked into it and I believe I've found the cause of the problem. I found I was able to reproduce the issue in a fresh environment. For me, this bare except was causing issues - it's a last-ditch effort to just ensure that we still get some output even if something's gone wrong, and just return the original 'non-standard word' before we start trying to normalise it. However, this except clause was covering up the fact that nltk resources that the repo expects might not be present.

Follow up actions

If I were writing this program again, I wouldn't rely on this bare except, which can cover up a multitude of sins (as it did here). Unfortunately, I don't have time at the moment for a substantial rewrite, but I should be able to find time to add an extra except clause for a LookupError, which is what is raised by nltk when it doesn't have the expected resources. That should make the required download transparent to other users. Does that sound reasonable as a mitigation?

Thanks everyone on the ticket for pointing out the issue, and sorry again for the slow responce!

EFord36 commented 5 years ago

Hi,

There is a fix for this in release 0.1.8 which is on both pypi and Github. The fix reaises the LookupErrors so that it is clear to the user what needs to be done. There doesn't appear to be a standard pattern for including nltk data dependencies in python packages, but I'll be sure to update this ticket if I find a solution which install the relevant dependencies automatically, which would obviously be a better experience for users than having to run the package repeatedly and deal with the errors.

Thanks for pointing out the issue!

Best,

Elliot