Question: How to use (external) transformer-based embeddings?

davidberenstein1957 / concise-concepts

This repository contains an easy and intuitive approach to few-shot NER using most similar expansion over spaCy embeddings. Now with entity scoring.

MIT License

242 stars 14 forks source link

Question: How to use (external) transformer-based embeddings? #30

Closed repodiac closed 1 year ago

repodiac commented 1 year ago

Hi,

your idea of "concise concepts" sounds really intriguing! However, I would like to use transformer-based embeddings - as far as I can see it from the source code, you rely on (word, vector) tuples in a large list like for instance in GloVe or Word2Vec models, right?

So, how could one implement this using HuggingFace models like spacy-transformer's tok2vec interface, maybe? Should I use the texts to be tagged for pretraining (i.e. "fine-tuning") a HF transformer model and then create this list by tokenizing all words (maybe getting rid of fill words or the like before) from the texts? Afterwards I'd have the same setting as with the current models, I guess.

Or maybe I am completely off the right track :-)

repodiac commented 1 year ago

I have to add that extended pretraining might not be necessary if there is no specific vocabulary needed for NER.
I also have to add that this list of tuples (word, vector) needs processing all "sentences" from the input texts (the ones used for tagging with NER) with the transformer (like BERT) so that one can use the token embeddings as vectors. This might lead to different embeddings for the same word (due to context-awareness of the transformer model). Then what, do some (mean-/max-)pooling?

It makes sense to me, but does it make sense to anyone else? :)

repodiac commented 1 year ago

And again, after looking more closely at the underlying code, a "drop-in" replacement with for instance transformer-based embeddings (i.e. vectors) seems to be easiest by transforming those to/wrapping with a gensim.KeyedVector format!?

It would be of great help if the documentation for custom embedding models would state this (i.e. "use KeyedVector") because neither from the text nor the example it's apparent that you do not need a ready-made gensim model!

davidberenstein1957 commented 1 year ago

Hi @repodiac, sorry for the late reply. I have been a bit busy during the past week. I guess that you could process all texts to obtain mean/max- pooled embeddings for each one of the word you might want to pick-up on based on the previous context you might have found them in, but this introduces significant overhead(having to embed a set of texts as knowledge base for the KeyedVector set).

And again, after looking more closely at the underlying code, a "drop-in" replacement with for instance transformer-based embeddings (i.e. vectors) seems to be easiest by transforming those to/wrapping with a gensim.KeyedVector format!?

How do you propose to do this? By default transforms use don't use complete words.