Muennighoff / sgpt

SGPT: GPT Sentence Embeddings for Semantic Search
https://arxiv.org/abs/2202.08904
MIT License
841 stars 51 forks source link

Possible to quantize into 4-bit and 8-bit and still use the models #24

Open regstuff opened 1 year ago

regstuff commented 1 year ago

Hi, was wondering if it's possible to do something like a GPTQ quantization into 8 or 4 bit and be able to use the embeddings from the models. GPTQ 4-bit models perform quite well compared to fp16 & 32 in text generation. Wasn't sure if such a thing would work for embeddings. Any suggestions?

Muennighoff commented 1 year ago

I havn't looked into that. It would likely reduce the expressivity of the embeddings, so I would expect worse results, but it may still be good enough to make the saved compute worth it.

In usual language model modelling the final output vectors are reduced to discrete tokens, so being off by e.g. 0.0001 due to precision may not change the generated token, hence performance impacts are small. In embeddings, however, the continuous output vectors are directly used to compare with other vectors e.g. via cosine similarity. Being off by 0.0001 is guaranteed to change the resulting similarity score.