efeslab / Atom

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
277 stars 24 forks source link

[Major] Add support for Mixtral8x7b #16

Closed cylinbao closed 7 months ago

cylinbao commented 7 months ago

Add simulated quantization for Mixtral8x7b. One major difference to Llama is that we move the activation quantization to after the gate operation of the SpareMoeBlock. I also update the transformers library version to 3.39.0 for better support on the Mixtral model. Currently, we have 4.41 perplexity on wikitext2 for W4A4 quantization.