I want to run this on a R7-6800H laptop using CPU however I do not have a NVIDIA card to run the quantize python code. can someone provide me a int8 or 16 version of quantized model or give me some instructions on how to write one to let CPU to do the work?
P.S Should the model works in LLaMa.cpp?
Expected Behavior
No response
Steps To Reproduce
have a laptop with out NVIDIA GPU and run the quantize python code.
Environment
- OS:windows 11
- Python:3.10.9
- Transformers:according to requirements.txt
- PyTorch:requirements.txt
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :False
This might be what you are looking for: https://huggingface.co/THUDM/chatglm-6b-int4
Load it on CPU:
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4",trust_remote_code=True).float()
8G RAM is required.
Is there an existing issue for this?
Current Behavior
I want to run this on a R7-6800H laptop using CPU however I do not have a NVIDIA card to run the quantize python code. can someone provide me a int8 or 16 version of quantized model or give me some instructions on how to write one to let CPU to do the work? P.S Should the model works in LLaMa.cpp?
Expected Behavior
No response
Steps To Reproduce
have a laptop with out NVIDIA GPU and run the quantize python code.
Environment
Anything else?
No response