MartinoMensio / spacy-universal-sentence-encoder

Google USE (Universal Sentence Encoder) for spaCy
MIT License
176 stars 12 forks source link

sentencize & encode #4

Closed rxjx closed 4 years ago

rxjx commented 4 years ago

Hi, I'm trying to set up a pipeline so that I can feed it a chunk of text, get spacy to sentencize it in the first pipeline component and then embed each sentence with USE in an overwrite_vectors component. However, I just cannot get it to do this. Is there a canonical way of setting up such a pipeline? Or would it be as efficient to just sentencize into an array of sentences and then embed them one by one? USE does have a batch embed as well so maybe that's a better option. What would you recommend?

MartinoMensio commented 4 years ago

Hi! For batch processing, the model I'm relying on https://tfhub.dev/google/universal-sentence-encoder/4 has the embed method that works on batches. In this repository I'm currently using it with one single sentence at a time, as you can see here https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub/blob/b90d667b4fef8386d224534f900cc7ab2a53888a/spacy_universal_sentence_encoder/language.py#L119 that is surely a bottleneck for perfomances. SpaCy itself has the pipe method https://spacy.io/api/language#pipe that lets you improve the performances of batch processing, but I still didn't look at how can I make the component overwrite_vectors to be batch friendly and route the call in batches to Universal Sentence Encoder.

I would suggest investigating how the yielding of docs works inn the spacy pipe method and if they have any documentation on how to manage custom components https://spacy.io/usage/processing-pipelines#custom-components to be batch friendly. If you discover anything useful please let me know because I will have to face the same problem soon in my experimentations.

For a non-optimised version, that is embedding each of the sentences with a different call to USE, you can try the following piece of code:

import spacy
nlp = spacy.load('en_core_web_lg')

# for sentencization you these two main options:
# 1. Rely on the statistical parsing that also splits sentences.
# Consider the fact that`en_core_web_sm`, `en_core_web_md` and `en_core_web_lg` have different performances.
# 2. use rule-based sentencizer: https://spacy.io/api/sentencizer better for article that have no errors in punctuation (e.g. no missing periods).
# In this case you need to load the sentencizer as first stage of the pipeline
if 'sentencizer' not in nlp.pipe_names:
    sentencizer = nlp.create_pipe('sentencizer')
    nlp.add_pipe(sentencizer, first=True)

# get the pipeline component
overwrite_vectors = nlp.create_pipe('overwrite_vectors')
# add to your nlp the pipeline stage
nlp.add_pipe(overwrite_vectors)

# Then pick the USE model of your choice
# This can be done by load the model directly: `nlp = spacy.load('en_use_md')`
# but in this case we wanted to use the `en_core_web_lg` module
# so we need to add the USE with a pipeline stage manually

def set_tfhub_model_url(doc):
    doc._.tfhub_model_url = 'https://tfhub.dev/google/universal-sentence-encoder-large/5'
    return doc

# add this pipeline component before the `overwrite_vectors`, because it will look at that extension
nlp.add_pipe(set_tfhub_model_url, before='overwrite_vectors')

# Then use the sentencizer to split the text into sentences
doc = nlp('This is a document. It is made of multiple sentences. This last sentence is saying something.')
sents = list(doc.sents)

# and use in some way the vectors of the sentences
for s in sents:
    vector = s.vector
    print('sentence', s, 'has a vector with size', vector.shape)

This full code is simplified a lot if you are ok with the sentencizer of en_core_web_sm and with the smaller USE https://tfhub.dev/google/universal-sentence-encoder/4

import spacy
nlp = spacy.load('en_use_md')
# nlp already has the parser, so we will use the sentencization done by it

# so just use the language
doc = nlp('This is a document. It is made of multiple sentences. This last sentence is saying something.')
sents = list(doc.sents)
# and use in some way the vectors of the sentences
for s in sents:
    vector = s.vector
    print('sentence', s, 'has a vector with size', vector.shape)
rxjx commented 4 years ago

Thank you for the quick response. I already tried using spacy and TF separately so I was aware that TF allows embed to get a list of sentences to embed. Also, I do need to use en_core_web_lg so your information helps. I'll experiment a bit more but I suspect I might just go back to using sentencing separately because my use-case might require a bit of intermediate processing anyway.