NIHOPA / NLPre

Python library for Natural Language Preprocessing (NLPre)
190 stars 34 forks source link

Problem with titlecaps and dedash #105

Closed grantdjones closed 6 years ago

grantdjones commented 6 years ago

Interesting bug

If you make a sentence that is all caps with a dash, dedash doesn't recognize it -which I see is part of the code, fine

x = (dedash(),) test = "THIS IS A TEST OF TREAT- MENT" for y in x: ... print y(test) ... THIS IS A TEST OF TREAT- MENT

However, if you try to decaptitalize it all first, titlecaps adds a space after the decaptitalized word, which prevents it from being recognized by dedash

x = (titlecaps(),dedash(),) for y in x: ... print y(test) ... this is a test of treat - ment THIS IS A TEST OF TREAT- MENT

This doesn't appear to be a problem as long as the sentence that has the word needing a dedashing isn't in ALL CAPS.

test = "This is a test of treat- ment. AND WHEN IT IS IN ALL CAPS" for y in x: ... test = y(test) ... print test ... This is a test of treatment. AND WHEN IT IS IN ALL CAPS This is a test of treatment . and when it is in all caps


titlecaps.py

    sents = sentence_tokenizer(text)

    doc2 = []
    for sent in sents:
        if not is_any_lowercase(sent):

            if len(sent) > self.min_length:
                self.logger.info("DECAPING: '{}'".format(' '.join(sent)))
                sent = [x.lower() for x in sent]

        **doc2.append(' '.join(sent))** #THIS IS ONE POSSIBLE PROBLEM

    doc2 = ' '.join(doc2)
    return doc2

I think at the step above, it's adding a space when joining the individual words back together

thoppe commented 6 years ago

:+1: This is a very good bug report! I'll look into it.

thoppe commented 6 years ago

Fixed. Unit test added. Problem was how sentences were being split, pattern.en considered the dash to be part of the splitting punctuation.

thoppe commented 6 years ago

@grantdjones you can install the latest version with the command pip install nlpre --upgrade