UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.86k stars 2.44k forks source link

Custom Model Question on Dense Layer #1615

Open mzl9039 opened 2 years ago

mzl9039 commented 2 years ago

Hi, I'm new to NLP and trying to pre-train a transformer, but the default dimension is high so I add a linear layer according to the demo below:

from sentence_transformers import SentenceTransformer, models
from torch import nn

word_embedding_model = models.Transformer('all-MiniLM-L6-v2, max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=64, activation_function=nn.Tanh())

model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])

I use the TripletLoss to pretrain the sentence embedding.

but after pretrained, I found almost all the sentence embeddings like: [-0.99935, 0.99925, -0.99934, 0.99956, ...], all the entries are close to -1 or 1, obviously this is caused by nn.Tanh; and if I remove the dense layer from the model, the embeddings looks good.

Could you help explaining this?

perceptiveshawty commented 2 years ago

try reducing your max sequence length (128, 64, etc. depending on your task) and increasing the dimension of the dense layer..

the size of word embeddings from language models are usually 512 or 768 depending on the variant - 64 dimensions is too small to encode a whole sentence/paragraph, and this constraint on representation will get worse the more homogenous your data is

good luck :)