barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
714 stars 164 forks source link

spellchecker in pypi #42

Closed grofte closed 5 years ago

grofte commented 5 years ago

If you accidentally install spellchecker instead of pyspellchecker you get something installed and it's version 0.4.0 but it just doesn't work. Is it yours? Can you remove it so people don't accidentally install the wrong library? It's an easy mistake to make when the import command is for spellchecker.

barrust commented 5 years ago

Unfortunately, spellchecker in pypi is not my project so I can not delete it or anything. If the documentation can be made more clear on installation, I am happy to update it.

grofte commented 5 years ago

No, I just thought it might have been an errant creation of yours.

There is something weird with pyspellchecker though. If I do something like this

import re
import contractions as con
from spellchecker import SpellChecker
from nltk import TweetTokenizer

def make_spellchecker():
    """
    Initialise spellchecker object with a dictionary based on the words in the pre-trained embeddings
    """
    spell = SpellChecker(language=None, local_dictionary='../../../data/external/custom_spell.json')
    return spell 

def make_tokenizer():
    """
    Initialise tokenizer
    """
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    return tokenizer 

def tokenize1(string, spellcheck, tokenizer, spell):
    """
    takes string input and tokenizes into a list of strings. 

    Examples
    >>> tokenize1('MC Hammer (He is credited as Hammer) portrays a druglord...so you do a math. Bruce Payne', True, make_tokenizer(), make_spellchecker())
    ['mc', 'hammer', '(', 'he', 'is', 'credited', 'as', 'hammer', ')', 'portrays', 'a', 'druglord', '...', 'so', 'you', 'do', 'a', 'math', '.', 'bruce', 'payne']
    """

    string = re.sub('\"', '', string)
    string = re.sub("\'s ", " _possessivetag_ " , string)
    string = re.sub('(?<=[.,])(?=[^\W])', ' ' , string)
    string = re.sub("\--", ' -- ' , string)
    #string = re.sub('(?P<rep>.)(?P=rep){3,}', '\g<rep>\g<rep>\g<rep>', string)
    string = re.sub("\d+|\+"," \g<0> ", string)

    string = con.fix(string)

    tokens = tokenizer.tokenize(string)

    if spellcheck == True:
        tokens = [spell.correction(token) if len(token) > 5 else token for token in tokens]

    tokens = ["'s" if token == "_possessivetag_" else token for token in tokens]

    return tokens

and call it on a few thousand strings of maybe 500 characters each

# create objects for processing strings
tokenizer = make_tokenizer()
spell = make_spellchecker()

# tokenize
df['review_tokens'] = ''
df['summary_tokens'] = ''
for i in range(len(df)):
    df.at[i, "review_tokens"] = tokenize1(df.at[i, "reviewText"], True, tokenizer, spell)
    df.at[i, 'summary_tokens'] = tokenize1(df.at[i, "summary"], True, tokenizer, spell)
print("Done tokenizing.")

Then there's something that balloons in memory and crashes everything. It's fine with spell check False. Is there any reason that the spell object might grow in size when it gets called?

barrust commented 5 years ago

Nope, unfortunately it isn't just an errant package on my part.

As for your other issue here, there is nothing that I can figure that would increase the memory of the spell object based on usage.

grofte commented 5 years ago

Okay. For each repeat of the loop the memory usage goes up and down. In the beginning it goes maybe 3-7 GB up but in the end it eats up 14 GB and I run out.