UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.36k stars 2.39k forks source link

MatryoshkaLoss and AdaptiveLayerLoss (related to ESE paper) - outdated paper reference or implications of superior dimensionality reduction strategy? #2682

Open bobox2997 opened 1 month ago

bobox2997 commented 1 month ago

In the SBERT repository, I found the adaptive layers method referenced in this paper: ESE: Espresso Sentence Embeddings...however, it retains the name (2DMSE) from the preprint paper: 2D Matryoshka Sentence Embeddings.

The documentation explains that this "adaptive layers" method (AdaptiveLayerLoss) can be combined with MatryoshkaLoss (as referenced in Matryoshka Representation Learning, MRL). However, the paper that is actually cited as a reference for adaptive layers already includes a dimensionality reduction strategy, which is proven in the paper to outperform MRL

is Matryoshka2dLoss the implementations of the ESE paper (the current reference for adaptive loss) or just take the layer reducion from ESE paper and the dimensionality reduction from MLR? i ask this because MRL seems to introduce much more degradations while some strategies from ESE: Espresso Sentence Embeddings seems to "increase" performance , maybe for the PCA on embedding dependencies and/or the weighted loss applied to different layers

Thanks in advance!

(Perhaps I misunderstood everything, and in that case, I sincerely apologize)

tomaarsen commented 1 month ago

Hello!

I wasn't actually aware of the ESE paper! I implemented the 2DMSE within the week that the v1 preprint of that paper came out, so it still uses the "original" implementation. It's indeed just dimensionality reduction using MRL.

MatryoshkaLoss implementation: https://github.com/UKPLab/sentence-transformers/blob/684b6b5736c4551c17c62830e975e78edcca0fa0/sentence_transformers/losses/MatryoshkaLoss.py#L132-L133 We kind of hackishly update the model's forward method to truncate to a specific dimension & normalize, and then we call an underlying loss function. We repeat this for multiple dimensions with caching so that the same embeddings don't need to be recomputed for each dimension, but only truncated again for each.

AdaptiveLayerLoss implementation: https://github.com/UKPLab/sentence-transformers/blob/684b6b5736c4551c17c62830e975e78edcca0fa0/sentence_transformers/losses/AdaptiveLayerLoss.py#L204-L205 We first call some underlying loss normally, but we cache all of the embeddings at each layer. Then, we iterate over all non-final layers and recompute the loss using the cached embedding at that layer. Then, we also have a KL divergence between the intermediary layer embedding and the final embedding.

Matryoshka2dLoss implementation: https://github.com/UKPLab/sentence-transformers/blob/684b6b5736c4551c17c62830e975e78edcca0fa0/sentence_transformers/losses/Matryoshka2dLoss.py#L92-L107 It's literally just a combination of both losses. Each of the previous 2 losses accepts another loss as an argument, so Matryoshka2Loss is just

loss = MultipleNegativesRankingLoss(model)
loss = AdaptiveLayerLoss(model, MatryoshkaLoss(model, loss, ...))

So, this implementation does not use PCA anywhere, it's MRL only. I would be happy to accept a pull request for an ESELoss, the results seem quite promising!

Edit: Because ESE consists of 2 steps, I'm not sure if 1 ESELoss could cover it. Perhaps they need to be implemented in 2 separate loss functions.

SeanLee97 commented 1 month ago

@bobox2997 @tomaarsen Sorry for the confusion. We updated the Arxiv a few weeks ago (changing the name from 2DMSE to Espresso). Since they have a similar idea, we didn't create a new Arxiv paper.

cc @csroyli

tomaarsen commented 1 month ago

Thanks for clarifying! I'll try to dive into the paper when I have a bit more time (it's a bit hectic with the release now) - you always do promising work.