Open wboleksii opened 4 years ago
Hi @wboleksii, there was / is not specific reason. I used float32 as it is the default. Converting tensors to float16 can sometimes decrease the accuracy slightly.
But thanks for pointing this out, I will perform some experiments with float16 and maybe release according models. Did you observe any changes in the run time when using float16 instead of float32?
Best Nils Reimers
@nreimers ,
Converting tensors to float16 can sometimes decrease the accuracy slightly.
Yes, but it's very minimal. In terms of cosine similarity I'm getting minimum 6 same significant digits. I guess that's pretty good.
Did you observe any changes in the run time when using float16 instead of float32?
Yes, it's faster and it also has much lower memory footprint. I'm trying to optimize this model for a low mem device.
It would be nice to have some kind of script for a generic model quantization, so you don't have to keep so many models up-to-date. I believe TensorFlow Lite does something like this.
@wboleksii Sounds good, will give it a try.
@wboleksii Could you show a code snippet of how you've converted your model to fp16 ?
I'm also interested in lower memory footprint and faster inference time on GPUs with tensor cores
@nreimers Have you done any experimentation with this? I'm very interested in speeding up inference using FP16 compute.
It seems like you can just call .half()
on a SentenceTransformer
and it will use FP16, giving you a nice speedup and memory savings. The resulting embeddings are very close to those of the full FP32 model.
The embeddings returned are still float32
datatype though, so it must be converted internally. I would prefer if the original float16 outputs were returned instead.
Hi,
Tensors in the saved
state_dict
have dtypefloat32
. Is there a reason for that? I was able to cut the size ofpytorch_model.bin
roughly in half by converting tensors to float16 without loosing any accuracy. I'm usingdistilbert-base-nli-mean-tokens
.