UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.48k stars 2.4k forks source link

Using `LayerNorm` before PCA while performing embedding dimensionality reduction. #2657

Open Adversarian opened 2 months ago

Adversarian commented 2 months ago

Hi, I would to begin by thanking you for your tremendous work on this library.

I had a question regarding your dimensionality_reduction.py example.

This is where you fit a PCA on top of the embeddings obtained by your model:

# To determine the PCA matrix, we need some example sentence embeddings.
# Here, we compute the embeddings for 20k random sentences from the AllNLI dataset
pca_train_sentences = nli_sentences[0:20000]
train_embeddings = model.encode(pca_train_sentences, convert_to_numpy=True)

# Compute PCA on the train embeddings matrix
pca = PCA(n_components=new_dimension)
pca.fit(train_embeddings)
pca_comp = np.asarray(pca.components_)

I was wondering if it would be better if first a LayerNorm with elementwise_affine=False was appended to the model to ensure PCA receives standardized inputs. I've extended sentence-transformer's models.LayerNorm so that it accepts additional args and kwargs for self.norm and performed this experiment on my own dataset (which I'm not at liberty to share unfortunately), and it seems to be performing better than plain PCA with no LayerNorm.

I was just wondering if somehow I got lucky with my particular data or if it's something to actually consider when performing dimensionality reduction.

Thanks in advance!

Jakobhenningjensen commented 2 months ago

You can just use sklearn on un-normalized data

from sklearn.preprocessing import normalize
pca_train_sentences = nli_sentences[0:20000]
train_embeddings = model.encode(pca_train_sentences, convert_to_numpy=True)
normalized_embeddings = normalize(train_embeddings)

# Compute PCA on the train embeddings matrix
pca = PCA(n_components=new_dimension)
pca.fit(normalized_embeddings)
pca_comp = np.asarray(pca.components_)

or just use normalize_embeddings=True for the model.encode

pca_train_sentences = nli_sentences[0:20000]
train_embeddings = model.encode(pca_train_sentences, convert_to_numpy=True, normalize_embeddings=True)

# Compute PCA on the train embeddings matrix
pca = PCA(n_components=new_dimension)
pca.fit(train_embeddings)
pca_comp = np.asarray(pca.components_)
Adversarian commented 1 month ago

@Jakobhenningjensen Thanks for your response!

The idea is to not have to use an external preprocessing and have the model perform end-to-end forward passes natively. Which is why we use a dense layer here filled with the PCA components instead of pca.transforming new inputs every time and that means your first suggestion isn't desirable as it requires sklearn's normalize every time during inference.

Secondly, LayerNorm and torch.nn.functional.normalize (which is what happens if you set normalize_embeddings=True) do very different things. Since PCA is sensitive to data scale, it's a good practice to z-score standardize your data before fitting a PCA on top of it which is what LayerNorm with elementwise_affine=False does (or leaving elementwise_affine=True and never using it in training mode). Meanwhile, torch.nn.functional.normalize, simply divides each tensor by its $L_p$ norm to ensure that all tensors have unit length in $L_p$ space. I'm not sure if these two scenarios are mathematically equivalent from the point of view of PCA. I'm just pointing out the differences.