Open lejinvarghese opened 5 days ago
Hello!
I'm afraid the INCTrainer/OVTrainer/ORTTrainer aren't directly compatible with Sentence Transformers. Beyond that, you can load models in specific quantization with bitsandbytes
via model_kwargs
, and you can use PEFT as well (though that doesn't do quantization per se).
I do want to preface that there's a difference between quantization from Speeding up Inference and Embedding Quantization. The former allows for faster inference, and the latter is a post-processing of the output embeddings so that downstream tasks (e.g. retrieval) are faster.
So there's a difference between:
For the first one, there's not great options out of the box to my knowledge, and for the latter you can consider the Binary Passage Retrieval Loss (bpr_loss) which is compatible with Sentence Transformers.
Is it possible to perform
Quantization Aware Training
on Sentence Transformers, beyond fp16 and bf16 that are directly supported in the transformer training_args? Are there other options for doing binary quantization during training, other than using the Intel/Neural Compressor INCTrainer or OpenVino OVTrainer?