dvmazur / mixtral-offloading

Run Mixtral-8x7B models in Colab or consumer desktops
MIT License
2.28k stars 223 forks source link

exl2 #4

Open eramax opened 6 months ago

eramax commented 6 months ago

using exl2 2.4 you can run mixtral on colab, did you give it a try ?

dvmazur commented 6 months ago

Hey! We are currently looking into other quantization approaches, both to improve inference speed and LM quality. How good is exl2's 2.4 quantization? 2.4 bits per parameters sounds like it reduces perplexity quite a bit. Could you provide any links, so we can look into it?

eramax commented 6 months ago

@dvmazurm I made this example for you https://gist.github.com/eramax/b6fc0b472372037648df7f0019ab0e78 one note is colab T4 with 15 GB Vram is not enough for the context of Mixtral-8x7B if it was 16 GB it will work fine, since we need some vram for the context beside the model and the 2.4 model get loaded in about 14.7 GB.