UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.03k stars 2.45k forks source link

question on TSDAE #2976

Open yangyangyyy123 opened 3 days ago

yangyangyyy123 commented 3 days ago

I'm looking for a way to generate sentence embeddings in a completely unsupervised way, and came across your work in TSDAE https://huggingface.co/kwang2049/TSDAE-scidocs2nli_stsb I was hoping you could help me understand it better, thanks in advance!

the decoder sets K and V both to the single output CLS embedding context vector from the encoder, while the Q is the list of embeddings in the previous layer, of length T. since normally in attention calculation, the end result is the weighted summation of V, based on affinity (dot ) between each element of Q and all elements of K, in TSDAE, because there is only 1 embedding in V, literally all the affinity is 1, so every time the weighted sum is just V itself? So as a result, each layer simply puts V (the CLS output embedding) on each position of each layer?

from the matrix multiplication formula, it's also clear: softmax(QK')V. (sqrt(d) removed since it does not affect the problem)

Q is H_previous and has dimension t x d , K is CLS embedding , has dimension 1 x d, so QK' is the affinity , has dimension t x 1, after softmax, since each row only has 1 element, each row becomes 1.0 then softmax(QK') V must ==== V.extend(t, -1) ??

am I missing something?

ScottishFold007 commented 1 day ago

Your understanding of the mechanism in the TSDAE (Transformer-based Self-supervised Denoising Auto-Encoder) model as described seems mostly accurate, but let's clarify a few points to ensure there's no confusion, especially regarding the attention mechanism and how it's applied in this context.

In a typical Transformer model, the attention mechanism is computed as:

[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V ]

where (Q), (K), and (V) represent the queries, keys, and values matrices respectively, and (d_k) is the dimensionality of the keys.

In the context of TSDAE, as you've described, the model employs a unique approach where the decoder uses the CLS token embedding from the encoder as both the key ((K)) and value ((V)) in its attention mechanism, while the queries ((Q)) come from the previous layer's output. This setup is indeed different from the standard multi-head attention mechanism where (Q), (K), and (V) are usually derived from the same input sequence or in the case of encoder-decoder attention, where (Q) comes from the decoder and (K), (V) from the encoder.

Given that (K) (and thus (V)) is a single vector (the CLS embedding), the dot product (QK^T) results in a matrix of shape (t \times 1), where (t) is the sequence length of the decoder input. After applying the softmax function, each element in this matrix will indeed be 1, because there's only one element in each "row" to normalize over. This means that the attention weights are uniform and equal to 1.

However, the interpretation that "each layer simply puts (V) (the CLS output embedding) on each position of each layer" might be slightly misleading. While it's true that the attention mechanism in this specific setup will result in each position attending fully to the (V) vector (the CLS embedding), the output of the attention mechanism is not the end of the story. The Transformer decoder layers also include feed-forward networks (FFNs) after the attention mechanism, which can further process the attended information. Therefore, while the attention mechanism does indeed focus solely on the CLS embedding, the subsequent processing allows for more complex manipulations of this information.

In essence, the TSDAE model leverages the CLS token embedding as a compact representation of the input sequence, and through the decoder's attention mechanism and FFNs, it attempts to reconstruct the original input from this compact representation. This approach is part of what enables TSDAE to learn useful sentence embeddings in an unsupervised manner, as it forces the model to capture the most salient information in the CLS token embedding to successfully perform the denoising task.

So, while your mathematical interpretation of the attention mechanism's output in this specific setup is correct, remember that the overall context of the model's architecture allows for more nuanced processing and learning from this seemingly simple operation.

yangyangyyy123 commented 1 day ago

thanks!

While it's true that the attention mechanism in this specific setup will result in each position attending fully to the (V) vector (the CLS embedding)

to make sure, is it true that after attention layer, in the decoder, each layer, each position, the output from attention layer is the same (V) ?

if so, isn't it true that no matter what the FF does, the output from FF (hence all the way to decoder output) will also be the same across all positions ?