Applied-Language-Technology / notebooks

Interactive Jupyter Notebooks for learning materials
47 stars 38 forks source link

Span embeddings with HuggingFace #8

Open ogarciasierra opened 3 years ago

ogarciasierra commented 3 years ago

Hi everyone. I was wondering if is it possible to do the same "span contextual embeddings" with a HuggingFace model. I`ve been able to generate token contextual embeddings (https://discuss.huggingface.co/t/generate-raw-word-embeddings-using-transformer-models-like-bert-for-downstream-process/2958), but cannot do it with spans. For example, in “three days ago I ate meat”, I would like to get contextual embeddings for “three days ago” in a similar way Tuomo does it with spaCy in the ALT blog.

Thanks everyone.

thiippal commented 3 years ago

Hi @ogarciasierra!

Just to make sure: I haven't really looked at doing this directly HuggingFace Transformers, so I assume that you would like to do extract contextual word embeddings for spans using spaCy?

ogarciasierra commented 3 years ago

Hi @thiippal

I would like to extract contextual embeddings for spans using Hugging Face Transformers or pytorch. The main thing is to use a Hugging Face model to generate those embeddings. I dont care which library we use for extracting them :)

Thanks!

thiippal commented 3 years ago

Okay @ogarciasierra, one way to do this is to follow the process here.

  1. Create the custom component for assigning Transformer features to the vector attribute of spaCy Token/Span/Doc elements.
  2. Then simply take a slice of the Doc object containing the Span of interest and access the vector attribute.

A demo, which assumes that you've created the custom component and added it to the Transformer-powered spaCy pipeline:

meat = nlp_trf("three days ago I ate meat")
left = nlp_trf("We left Finland three days ago")

meat_span = meat[0:3]    # get the Span for "three days ago" by indexing Token positions
left_span = left[3:7]    # do the same for the second Doc

meat_span.similarity(left_span)     # calculate cosine similarity

This outputs 0.8840232, which indicates that the two Spans indeed have similar vectors, but also incorporate information about the context in which they occur.

TL;DR: Just slice spaCy Docs and access the representation using the vector attribute.

ogarciasierra commented 3 years ago

Yes, I checked your code with spaCy before! But my doubt is about how to do it with a Hugging Face model and its own embeddings. Those trf_data atributes are onle available for spaCy models, I am afraid. The process is amazing with your spaCy tutorial, so I tried to do it with a pre trained Hugging Face model, its easy with just one token) , but wasn't able to do it with a span. Sorry to bother you again.