Open emapple opened 4 years ago
Additionally, because the cleaning functions rely on word_tokenize
there are extra spaces being added around all punctuation. This is because to reform the tokens into a single string I just used (" ").join(lemmatized)
. We should try to recreate the full string in a more clever way.
I don't think extra spaces really matter - any analysis we do ignores whitespace anyway, right?
If you're making n-grams spaces count as characters. Though I suppose you could probably get the same results if you just removed them.
In that case, something along the lines of re.sub('\s+', ' ', self._text)
will limit spaces to 1 in a row (\s+
will include any white space, including new lines).
The issue isn't double spaces, its single spaces being inserted where they shouldn't be. For example, the sentence:
Hello world, it is me.
Gets turned into:
Hello world , it is me .
How do n-grams address the end of a sentence?
We could also do something like text.replace(' .', '.')
n-grams are agnostic to the ends of sentences if there's a proceeding one. But that is a good fix idea.
There are a few issues with the
clean()
andtokenize()
methods:clean
: 1) It replaces-
with an empty string, which has the effect of concatenating words (e.g.,hello--word
becomeshelloworld
. We should replace-
instead with a space.tokenize
: 1) This should exclude sentences that are just chapter headings (including roman numerals)