Open fishcakebaker opened 3 years ago
Instead using old_tokenizer.tokens_from_list, you can substitute any custom tokenizer that does the correct input -> Doc conversion with the correct vocab for nlp.tokenizer:
from spacy.tokens import Doc
class _PretokenizedTokenizer:
"""Custom tokenizer to be used in spaCy when the text is already pretokenized."""
def init(self, vocab: en_nlp):
"""Initialize tokenizer with a given vocab
:param vocab: an existing vocabulary
"""
self.vocab = vocab
for i in range(0,len(List)):
def call(self, inp: [List[i], str]) -> Doc:
"""Call the tokenizer on input inp
.
:param inp: either a string to be split on whitespace, or a list of tokens
:return: the created Doc object
"""
if isinstance(inp, str):
words = inp.split()
spaces = [True] * (len(words) - 1) + ([True] if inp[-1].isspace() else [False])
return Doc(self.vocab, words=words, spaces=spaces)
elif isinstance(inp, list):
return Doc(self.vocab, words=inp)
else:
raise ValueError("Unexpected input format. Expected string to be split on whitespace, or list of tokens.")
'List' is used to store input string/text
Probably similar to @Tanvi09Garg here is what works for me:
import re
import spacy
from spacy.tokens import Doc
# regexp used in CountVectorizer
# (?u) sets unicode flag, i.e. patterns are unicode
# \\b word boundary: the end of a word is indicated by whitespace or a non-alphanumeric character
# \\w alphanumeric: [0-9a-zA-Z_]
class RegexTokenizer:
"""Spacy custom tokenizer
Reference https://spacy.io/usage/linguistic-features#custom-tokenizer
"""
def __init__(self, vocab, regex_pattern='(?u)\\b\\w\\w+\\b'):
self.vocab = vocab
self.regexp = re.compile(regex_pattern)
def __call__(self, text):
words = self.regexp.findall(text)
spaces = [True] * len(words)
spaces[-1] = False #no space after last word
return Doc(self.vocab, words=words, spaces=spaces)
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
nlp.tokenizer = RegexTokenizer(nlp.vocab)
def custom_tokenizer(document):
doc_spacy = nlp(document)
return [token.lemma_ for token in doc_spacy]
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(tokenizer=custom_tokenizer)
It runs a bit slow, any suggestions to speed this up?
The tokeniser attribute
.tokens_from_list
has been deprecated in SpaCy.This is used in Chapter 7, Section 7.8 "Advanced Tokenisation, Stemming and Lemmatization" in block In[39].
I'm using SpaCy version 3.0.6 - Which I am guessing is several versions higher than the book, I just can't find where it is in my copy.
Any suggestions on getting around this function? I'm a bit of a newbie, but the searches online have led to rabbit holes thus far.