PacktPublishing / Mastering-spaCy

Mastering spaCy, published by Packt
MIT License
126 stars 73 forks source link

Lemmatization in NLU #4

Closed zebrassimo closed 3 years ago

zebrassimo commented 3 years ago

Hi Duygu, in the book mastering spaCy, page 46, chapter 2 we have the following code:

import spacy
from spacy.symbols import ORTH, LEMMA
nlp=spacy.load('en')
special_case = [{ORTH: "Angeltown", LEMMA: "Los Angeles"}]
nlp.tokenizer.add_special_case(u'Angeltown',special_case)
doc=nlp(u'I am flying to Angeltown')
for token in doc:
    print(token.text, token.lemma_)

trying to import the language model en by

python -m spacy download en

doesn't work:

__As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the full pipeline package name 'en_core_web_sm' instead.__

Not sure if

special_case = [{ORTH: "Angeltown", LEMMA: "Los Angeles"}]
nlp.tokenizer.add_special_case(u'Angeltown',special_case)

fails with Unable to set attribute 'LEMMA' in tokenizer exception for 'Angeltown'. Tokenizer exceptions are only allowed to specify ORTH and NORM because of that.

However this get's the job done:

import spacy
from spacy.symbols import ORTH, LEMMA, NORM
nlp=spacy.load('en_core_web_md')
special_case = [{ORTH: "Angeltown", NORM: "Los Angeles"}]
nlp.tokenizer.add_special_case(u'Angeltown',special_case)
doc=nlp(u'I am flying to Angeltown')
for token in doc:
    print(token.text, token.lemma_)

Thanks for the good book!

DuyguA commented 3 years ago

I wrote some chapters before v3. I made a second pass to the code after v3 came out, some of the code might have escaped from my attention. I changed lang model shortcuts along the book, some might have escaped as well. Thanks for reporting!

RAravindDS commented 2 years ago

Hi Duygu, in the book mastering spaCy, page 46, chapter 2 we have the following code:

import spacy
from spacy.symbols import ORTH, LEMMA
nlp=spacy.load('en')
special_case = [{ORTH: "Angeltown", LEMMA: "Los Angeles"}]
nlp.tokenizer.add_special_case(u'Angeltown',special_case)
doc=nlp(u'I am flying to Angeltown')
for token in doc:
    print(token.text, token.lemma_)

trying to import the language model en by

python -m spacy download en

doesn't work:

As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the full pipeline package name 'en_core_web_sm' instead.

Not sure if

special_case = [{ORTH: "Angeltown", LEMMA: "Los Angeles"}]
nlp.tokenizer.add_special_case(u'Angeltown',special_case)

fails with Unable to set attribute 'LEMMA' in tokenizer exception for 'Angeltown'. Tokenizer exceptions are only allowed to specify ORTH and NORM because of that.

However this get's the job done:

import spacy
from spacy.symbols import ORTH, LEMMA, NORM
nlp=spacy.load('en_core_web_md')
special_case = [{ORTH: "Angeltown", NORM: "Los Angeles"}]
nlp.tokenizer.add_special_case(u'Angeltown',special_case)
doc=nlp(u'I am flying to Angeltown')
for token in doc:
    print(token.text, token.lemma_)

Thanks for the good book!

HI, This is not working properly, we changed to norm, not lemmas(entire topic is changing bro)

image

https://stackoverflow.com/questions/66360602/spacy-tokenizer-lemma-and-orth-exceptions-not-working