Possible to quantize into 4-bit and 8-bit and still use the models

Muennighoff / sgpt

SGPT: GPT Sentence Embeddings for Semantic Search

MIT License

841 stars 51 forks source link

I havn't looked into that. It would likely reduce the expressivity of the embeddings, so I would expect worse results, but it may still be good enough to make the saved compute worth it.

In usual language model modelling the final output vectors are reduced to discrete tokens, so being off by e.g. 0.0001 due to precision may not change the generated token, hence performance impacts are small. In embeddings, however, the continuous output vectors are directly used to compare with other vectors e.g. via cosine similarity. Being off by 0.0001 is guaranteed to change the resulting similarity score.

Muennighoff / sgpt

Possible to quantize into 4-bit and 8-bit and still use the models #24