UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.46k stars 2.5k forks source link

CNN example in avg word example is overhead? #679

Open omerarshad opened 3 years ago

omerarshad commented 3 years ago

so in example :

word_embedding_model = models.Transformer('bert-base-uncased')

cnn = models.CNN(in_word_embedding_dimension=word_embedding_model.get_word_embedding_dimension(), out_channels=256, kernel_sizes=[1,3,5])

Apply mean pooling to get one fixed sized sentence vector

pooling_model = models.Pooling(cnn.get_word_embedding_dimension(), pooling_mode_mean_tokens=True, pooling_mode_cls_token=False, pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding_model, cnn, pooling_model])

What is the purpose of applying CNN here? Isn't it better just to take mean of output of "word_embedding_model" ? It looks as applying CNN is an overhead. Anyone can explain?

nreimers commented 3 years ago

Hi @omerarshad It is just an example. With a transformer model as word_embedding_model, there is no need for a CNN layer.

However, if you use GloVe embeddings, then adding a CNN layer on top makes sense.

omerarshad commented 3 years ago

Exactly my point, is there a way to use word embeddings from BERT, such as no cross encoding is applied, I only get individual word embedding and then applying CNN on it? Actually i want to train a CNN which can take input of 1000 words, and i just want to encode each word using bert and pass to CNN

nreimers commented 3 years ago

is there a way to use word embeddings from BERT, such as no cross encoding is applied, I only get individual word embedding and then applying CNN on it?

This does not make sense. Apply BERT only make sense to get contextualized word embeddings, not for getting embeddings for individual words.

Use GloVe (or word2vec) etc. would be the much better solution.

You get this by replacing the line like this:

word_embedding_model = models.WordEmbeddings.from_text_file('glove.6B.300d.txt.gz')

https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/avg_word_embeddings/training_stsbenchmark_avg_word_embeddings.py

omerarshad commented 3 years ago

Yes but this type of solutions have several issues, OOV , large word vector file.

timpal0l commented 3 years ago

FastText can handle OOV better then w2v/glove I think, due to their tokenization. But as Nils said, it makes no sense to use a contextualized language model for generating single word embeddings. If you have a domain that does not work well with a pretrained word embedding model, you could train your own.