clean and tokenize issues

data-dart / bookend

2 stars 1 forks source link

clean and tokenize issues #22

Open emapple opened 4 years ago

emapple commented 4 years ago

There are a few issues with the clean() and tokenize() methods:

clean: 1) It replaces - with an empty string, which has the effect of concatenating words (e.g., hello--word becomes helloworld. We should replace - instead with a space.

tokenize: 1) This should exclude sentences that are just chapter headings (including roman numerals)

kdettman commented 4 years ago

Additionally, because the cleaning functions rely on word_tokenize there are extra spaces being added around all punctuation. This is because to reform the tokens into a single string I just used (" ").join(lemmatized). We should try to recreate the full string in a more clever way.

emapple commented 4 years ago

I don't think extra spaces really matter - any analysis we do ignores whitespace anyway, right?

kdettman commented 4 years ago

If you're making n-grams spaces count as characters. Though I suppose you could probably get the same results if you just removed them.

emapple commented 4 years ago

In that case, something along the lines of re.sub('\s+', ' ', self._text) will limit spaces to 1 in a row (\s+ will include any white space, including new lines).

kdettman commented 4 years ago

The issue isn't double spaces, its single spaces being inserted where they shouldn't be. For example, the sentence: Hello world, it is me. Gets turned into: Hello world , it is me .

emapple commented 4 years ago

How do n-grams address the end of a sentence?

We could also do something like text.replace(' .', '.')

kdettman commented 4 years ago

n-grams are agnostic to the ends of sentences if there's a proceeding one. But that is a good fix idea.