UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

Questions about PCA #833

Open minyoung90 opened 3 years ago

minyoung90 commented 3 years ago

I have about 6 million sentences and my embedding vector size is 768 using SBERT. The problem is that embedding data is too large! (6 million sentences produce about over 200 GB) I never knew that a float data is too large! So, one of best solution is to use PCA in the sbert document. but, what I am wondering is that,

Q1) Do I need all 6 million embedding vectors for generating PCA component?

What I learned is... PCA needs all the data because of calculating covariance matrix for that.

Q2) If I can't use PCA, than any suggestions?

nreimers commented 3 years ago

6 Mio sentences should use 6 Mio 768 dim 4 bytes = about 18 GB. If you store it with FP16, you just need 9 GB.

You have some other issues in your data structure that has a high overhead. You can save it as numpy matrix with 6 mio x 768, which is quite storage efficient.

Regarding PCA: No, you don't need all embeddings. A small (representative) set is sufficient, e.g. you sample 50k embeddings and compute the PCA from that.