NetEase-FuXi / EETQ

Easy and Efficient Quantization for Transformers
Apache License 2.0
157 stars 12 forks source link

Quantization takes a very long time #10

Open timohear opened 5 months ago

timohear commented 5 months ago

Using TGI or Lorax eetq quantization takes several minutes (Eg 10 minutes for Mixtral) every time the launcher is run .

As a reference bitsandbytes nf4 quant takes 1 minute.

Is there any way to store and directly load the eetq model?

timohear commented 5 months ago

And thank you for eeqt, I've been wishing for high-speed 8-bit inference for quite some time :-)

SidaZh commented 5 months ago

@timohear It is very convenient to use eetq for model saving and loading, just like this

from eetq.utils import eet_quantize
eet_quantize(torch_model)
torch.save(torch_model, "xxx_eetq.pt")
...
torch.load(torch_model)

But it has not been adapted in TGI yet, your suggestion is very useful. We can optimize and test the loading eetq model process in TGI.

Narsil commented 2 months ago

Late to the party as I'm upgrading eetq at the moment. (TGI maintainer here).

We're not going to enable pickle pytorch at all, however safetensors save definitely. I think all we have to do is save the model like regularly and add in the config quantization_config a quant_method: eetq and that's it.