Questions about PCA - Githubissues

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

Apache License 2.0

15.28k stars 2.48k forks source link

I have about 6 million sentences and my embedding vector size is 768 using SBERT. The problem is that embedding data is too large! (6 million sentences produce about over 200 GB) I never knew that a float data is too large! So, one of best solution is to use PCA in the sbert document. but, what I am wondering is that,

Q1) Do I need all 6 million embedding vectors for generating PCA component?

What I learned is... PCA needs all the data because of calculating covariance matrix for that.

Q2) If I can't use PCA, than any suggestions?

UKPLab / sentence-transformers

Questions about PCA #833