Closed repodiac closed 1 year ago
(word, vector)
needs processing all "sentences" from the input texts (the ones used for tagging with NER) with the transformer (like BERT) so that one can use the token embeddings as vectors. This might lead to different embeddings for the same word (due to context-awareness of the transformer model). Then what, do some (mean-/max-)pooling?It makes sense to me, but does it make sense to anyone else? :)
And again, after looking more closely at the underlying code, a "drop-in" replacement with for instance transformer-based embeddings (i.e. vectors) seems to be easiest by transforming those to/wrapping with a gensim.KeyedVector format!?
It would be of great help if the documentation for custom embedding models would state this (i.e. "use KeyedVector") because neither from the text nor the example it's apparent that you do not need a ready-made gensim model!
Hi @repodiac, sorry for the late reply. I have been a bit busy during the past week. I guess that you could process all texts to obtain mean/max- pooled embeddings for each one of the word you might want to pick-up on based on the previous context you might have found them in, but this introduces significant overhead(having to embed a set of texts as knowledge base for the KeyedVector set).
And again, after looking more closely at the underlying code, a "drop-in" replacement with for instance transformer-based embeddings (i.e. vectors) seems to be easiest by transforming those to/wrapping with a gensim.KeyedVector format!?
How do you propose to do this? By default transforms use don't use complete words.
Hi,
your idea of "concise concepts" sounds really intriguing! However, I would like to use transformer-based embeddings - as far as I can see it from the source code, you rely on
(word, vector)
tuples in a large list like for instance in GloVe or Word2Vec models, right?So, how could one implement this using HuggingFace models like spacy-transformer's tok2vec interface, maybe? Should I use the texts to be tagged for pretraining (i.e. "fine-tuning") a HF transformer model and then create this list by tokenizing all words (maybe getting rid of fill words or the like before) from the texts? Afterwards I'd have the same setting as with the current models, I guess.
Or maybe I am completely off the right track :-)