Closed grofte closed 5 years ago
Unfortunately, spellchecker
in pypi is not my project so I can not delete it or anything. If the documentation can be made more clear on installation, I am happy to update it.
No, I just thought it might have been an errant creation of yours.
There is something weird with pyspellchecker though. If I do something like this
import re
import contractions as con
from spellchecker import SpellChecker
from nltk import TweetTokenizer
def make_spellchecker():
"""
Initialise spellchecker object with a dictionary based on the words in the pre-trained embeddings
"""
spell = SpellChecker(language=None, local_dictionary='../../../data/external/custom_spell.json')
return spell
def make_tokenizer():
"""
Initialise tokenizer
"""
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
return tokenizer
def tokenize1(string, spellcheck, tokenizer, spell):
"""
takes string input and tokenizes into a list of strings.
Examples
>>> tokenize1('MC Hammer (He is credited as Hammer) portrays a druglord...so you do a math. Bruce Payne', True, make_tokenizer(), make_spellchecker())
['mc', 'hammer', '(', 'he', 'is', 'credited', 'as', 'hammer', ')', 'portrays', 'a', 'druglord', '...', 'so', 'you', 'do', 'a', 'math', '.', 'bruce', 'payne']
"""
string = re.sub('\"', '', string)
string = re.sub("\'s ", " _possessivetag_ " , string)
string = re.sub('(?<=[.,])(?=[^\W])', ' ' , string)
string = re.sub("\--", ' -- ' , string)
#string = re.sub('(?P<rep>.)(?P=rep){3,}', '\g<rep>\g<rep>\g<rep>', string)
string = re.sub("\d+|\+"," \g<0> ", string)
string = con.fix(string)
tokens = tokenizer.tokenize(string)
if spellcheck == True:
tokens = [spell.correction(token) if len(token) > 5 else token for token in tokens]
tokens = ["'s" if token == "_possessivetag_" else token for token in tokens]
return tokens
and call it on a few thousand strings of maybe 500 characters each
# create objects for processing strings
tokenizer = make_tokenizer()
spell = make_spellchecker()
# tokenize
df['review_tokens'] = ''
df['summary_tokens'] = ''
for i in range(len(df)):
df.at[i, "review_tokens"] = tokenize1(df.at[i, "reviewText"], True, tokenizer, spell)
df.at[i, 'summary_tokens'] = tokenize1(df.at[i, "summary"], True, tokenizer, spell)
print("Done tokenizing.")
Then there's something that balloons in memory and crashes everything. It's fine with spell check False. Is there any reason that the spell
object might grow in size when it gets called?
Nope, unfortunately it isn't just an errant package on my part.
As for your other issue here, there is nothing that I can figure that would increase the memory of the spell object based on usage.
Okay. For each repeat of the loop the memory usage goes up and down. In the beginning it goes maybe 3-7 GB up but in the end it eats up 14 GB and I run out.
If you accidentally install
spellchecker
instead ofpyspellchecker
you get something installed and it's version 0.4.0 but it just doesn't work. Is it yours? Can you remove it so people don't accidentally install the wrong library? It's an easy mistake to make when the import command is forspellchecker
.