Closed kira-lin closed 5 months ago
I just tried out ALMA-7B-R on my rtx4070s and it's great! However, I wonder if it's possible to speed up inference further. Namely, do we have any quantized versions? Can I use llama.cpp to run this?
Thanks
Thanks for your interest! There are some un-official release at huggingface: https://huggingface.co/RichardErkhov/haoranxu_-_ALMA-13B-R-gguf
Please enjoy them :)
I just tried out ALMA-7B-R on my rtx4070s and it's great! However, I wonder if it's possible to speed up inference further. Namely, do we have any quantized versions? Can I use llama.cpp to run this?
Thanks