MartinoMensio / spacy-universal-sentence-encoder

Google USE (Universal Sentence Encoder) for spaCy
MIT License
176 stars 12 forks source link

KeyError when added as a pipe #14

Closed debraj135 closed 3 years ago

debraj135 commented 3 years ago

While the first and second option in the readme for using this works the third option

import spacy
# this is your nlp object that can be any spaCy model
nlp = spacy.load('en_core_web_sm')

# add the pipeline stage (will be mapped to the most adequate model from the table above, en_use_md)
nlp.add_pipe('universal_sentence_encoder')

throws a keyerror of the form

File "/.../spacy_universal_sentence_encoder/language.py", line 77, in use_model_factory
    config = util.configs[model_name]
KeyError: 'en_core_web_sm'

Is this behavior expected?

Thank you for any help.

MartinoMensio commented 3 years ago

Hi @debraj135, Thank you for opening the issue. It is not the expected behaviour. There was a little bug. It should now work as expected. Please install the updated v0.4.1

Best, Martino

debraj135 commented 3 years ago

Thank you, this works now. I believe I have encountered another issue outlined in this snippet

>>> import spacy
>>> nlp = spacy.load('en_core_web_lg')
>>> nlp.add_pipe('universal_sentence_encoder')
<spacy_universal_sentence_encoder.language.UniversalSentenceEncoder object at 0x1f65fc080>
>>> doc = nlp('Hi there, how are you?')
>>> doc.vector.shape
(512,)
>>> doc[:5].vector.shape
(512,)
>>> doc[:1].vector.shape
(300,)
>>> doc[0].similarity(doc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/token.pyx", line 212, in spacy.tokens.token.Token.similarity
  File "<__array_function__ internals>", line 6, in dot
ValueError: shapes (300,) and (512,) not aligned: 300 (dim 0) != 512 (dim 0)

Shouldn't the vectors for the tokens have the same dimensionality as span and doc?

MartinoMensio commented 3 years ago

Thanks for spotting this issue. It's a consequence of the change done for https://github.com/MartinoMensio/spacy-universal-sentence-encoder/issues/13

The underlying model has the Tokens in the vocabulary and therefore the array of shape (300,) comes from en_core_web_lg. This is not the expected behaviour. I will provide a fix, but if you switch back to v0.4.1 it should work as expected.

Thank you for your patience. Best, Martino

debraj135 commented 3 years ago

Thank you, appreciate your prompt response!