amueller / introduction_to_ml_with_python

Notebooks and code for the book "Introduction to Machine Learning with Python"
7.44k stars 4.56k forks source link

Tokenizer attribute .tokens_from_list deprecated #152

Open fishcakebaker opened 3 years ago

fishcakebaker commented 3 years ago

The tokeniser attribute .tokens_from_list has been deprecated in SpaCy.

This is used in Chapter 7, Section 7.8 "Advanced Tokenisation, Stemming and Lemmatization" in block In[39].

I'm using SpaCy version 3.0.6 - Which I am guessing is several versions higher than the book, I just can't find where it is in my copy.

Any suggestions on getting around this function? I'm a bit of a newbie, but the searches online have led to rabbit holes thus far.

Tanvi09Garg commented 3 years ago

Instead using old_tokenizer.tokens_from_list, you can substitute any custom tokenizer that does the correct input -> Doc conversion with the correct vocab for nlp.tokenizer:

from spacy.tokens import Doc class _PretokenizedTokenizer: """Custom tokenizer to be used in spaCy when the text is already pretokenized.""" def init(self, vocab: en_nlp): """Initialize tokenizer with a given vocab :param vocab: an existing vocabulary """ self.vocab = vocab for i in range(0,len(List)): def call(self, inp: [List[i], str]) -> Doc: """Call the tokenizer on input inp. :param inp: either a string to be split on whitespace, or a list of tokens :return: the created Doc object """ if isinstance(inp, str): words = inp.split() spaces = [True] * (len(words) - 1) + ([True] if inp[-1].isspace() else [False]) return Doc(self.vocab, words=words, spaces=spaces) elif isinstance(inp, list): return Doc(self.vocab, words=inp) else: raise ValueError("Unexpected input format. Expected string to be split on whitespace, or list of tokens.")

Tanvi09Garg commented 3 years ago

'List' is used to store input string/text

ypauchard commented 1 year ago

Probably similar to @Tanvi09Garg here is what works for me:

import re
import spacy
from spacy.tokens import Doc

# regexp used in CountVectorizer
# (?u) sets unicode flag, i.e. patterns are unicode
# \\b word boundary: the end of a word is indicated by whitespace or a non-alphanumeric character
# \\w alphanumeric: [0-9a-zA-Z_]

class RegexTokenizer:
    """Spacy custom tokenizer
        Reference https://spacy.io/usage/linguistic-features#custom-tokenizer
    """
    def __init__(self, vocab, regex_pattern='(?u)\\b\\w\\w+\\b'):
        self.vocab = vocab
        self.regexp = re.compile(regex_pattern)

    def __call__(self, text):
        words = self.regexp.findall(text)
        spaces = [True] * len(words)
        spaces[-1] = False #no space after last word

        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
nlp.tokenizer = RegexTokenizer(nlp.vocab)

def custom_tokenizer(document):
    doc_spacy = nlp(document)
    return [token.lemma_ for token in doc_spacy]

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(tokenizer=custom_tokenizer)

It runs a bit slow, any suggestions to speed this up?