UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.24k stars 2.47k forks source link

Creating embeddings before training model #2930

Open jinaduuthman opened 1 month ago

jinaduuthman commented 1 month ago

@tomaarsen, Hi, I am using the Sbert Trainer method and specifically using the Triplet pairs. [query,positive,negative] Now I need to add some text to the query([query+some_long_text, positive, negative]) and it would be longer than the max_seq_len and I don't want it truncated.

I read it somewhere that I can create an embedding for the some_long_text and pass this to the model training. I think this looks weird since I am concatenating the embedding with text that way. I have also read one thread here thatcreating embedding before feeding into the model makes the model not to adjust the pretrained weights, is there any better way to do this?

Note that I am using the MultipleNegativeRankingLoss.

jinaduuthman commented 1 month ago

@tomaarsen

tomaarsen commented 1 month ago

Hello!

Apologies for the delayed response.

I read it somewhere that I can create an embedding for the some_long_text and pass this to the model training.

I haven't heard about this yet.

I have also read one thread here thatcreating embedding before feeding into the model makes the model not to adjust the pretrained weights, is there any better way to do this?

I think this was likely referring to if you create all embeddings before training a model and then using those, then you're not iteratively updating the model weights like is required for actually training a better model. There's a few reasons why that doesn't work, but in short, then gradient descent doesn't work.

I don't think there's a convenient way to avoid the truncation, I'm afraid.

jinaduuthman commented 1 month ago

Thank you for your response.