UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

Tensor dtype #79

Open wboleksii opened 4 years ago

wboleksii commented 4 years ago

Hi,

Tensors in the saved state_dict have dtype float32. Is there a reason for that? I was able to cut the size of pytorch_model.bin roughly in half by converting tensors to float16 without loosing any accuracy. I'm using distilbert-base-nli-mean-tokens.

nreimers commented 4 years ago

Hi @wboleksii, there was / is not specific reason. I used float32 as it is the default. Converting tensors to float16 can sometimes decrease the accuracy slightly.

But thanks for pointing this out, I will perform some experiments with float16 and maybe release according models. Did you observe any changes in the run time when using float16 instead of float32?

Best Nils Reimers

wboleksii commented 4 years ago

@nreimers ,

Converting tensors to float16 can sometimes decrease the accuracy slightly.

Yes, but it's very minimal. In terms of cosine similarity I'm getting minimum 6 same significant digits. I guess that's pretty good.

Did you observe any changes in the run time when using float16 instead of float32?

Yes, it's faster and it also has much lower memory footprint. I'm trying to optimize this model for a low mem device.

It would be nice to have some kind of script for a generic model quantization, so you don't have to keep so many models up-to-date. I believe TensorFlow Lite does something like this.

nreimers commented 4 years ago

@wboleksii Sounds good, will give it a try.

cpcdoy commented 4 years ago

@wboleksii Could you show a code snippet of how you've converted your model to fp16 ?

I'm also interested in lower memory footprint and faster inference time on GPUs with tensor cores

sinking-point commented 2 years ago

@nreimers Have you done any experimentation with this? I'm very interested in speeding up inference using FP16 compute.

sinking-point commented 2 years ago

It seems like you can just call .half() on a SentenceTransformer and it will use FP16, giving you a nice speedup and memory savings. The resulting embeddings are very close to those of the full FP32 model.

The embeddings returned are still float32 datatype though, so it must be converted internally. I would prefer if the original float16 outputs were returned instead.