Please check that this issue hasn't been reported before.
[X] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
As mentioned in the docs, when quantizing jamba, it is recommended to exclude the "mamba" layer.
You can easily quantize the model to 8-bit using bitsandbytes. In order to not degrade model quality, we recommend to exclude the Mamba blocks from the quantization:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True,
llm_int8_skip_modules=["mamba"])
model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
quantization_config=quantization_config)
Current behaviour
The entire model is automatically quantized.
Steps to reproduce
...
Config yaml
No response
Possible solution
No response
Which Operating Systems are you using?
[X] Linux
[ ] macOS
[ ] Windows
Python Version
3.10
axolotl branch-commit
main
Acknowledgements
[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
Please check that this issue hasn't been reported before.
Expected Behavior
As mentioned in the docs, when quantizing jamba, it is recommended to exclude the
"mamba"
layer.Current behaviour
The entire model is automatically quantized.
Steps to reproduce
...
Config yaml
No response
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements