-
How hard would it be to write an inference engine based on exllama that supported tensor parallel, using the existing building blocks?
Assume the quantized weight tensors would need to be split acr…
-
**Following the readme.md, I tried to run RAP for gsm8k using exllama, with the recommended instruction:**
`CUDA_VISIBLE_DEVICES=0,1 python examples/RAP/gsm8k/inference.py --base_lm exllama --exlla…
-
### OS
Windows
### GPU Library
CUDA 12.x
### Python version
3.12
### Describe the bug
Hi thanks the project support EBNF grammar and JSON Schema, however, I am unable to use it
I believe it is…
-
It's not clear from the documentation how to split VRAM over multiple GPUs with exllama.
-
I'm new to exllama, are there any tutorials on how to use this? I'm trying this with the llama-2 70b model.
-
Need to switch to exllama, everything I'm reading about is how exllama is better. At least for production we will need to switch. Speed is everything at the inference volume we expect. Note to try VLL…
-
ExLlama (https://github.com/turboderp/exllama)
It's currently the fastest and most memory-efficient executor of models that I'm aware of.
Is there an interest from the maintainers in adding this sup…
-
- https://github.com/turboderp/exllama
- https://github.com/oobabooga/text-generation-webui/blob/c7058afb402bd381d1983837b779c106217120b3/modules/exllama.py
-
----> 4 gptq_model = exllama_set_max_input_length(gptq_model, max_input_length=7504)
/usr/local/lib/python3.10/dist-packages/auto_gptq/utils/exllama_utils.py in exllama_set_max_input_length(model, …
icivi updated
4 months ago
-
/root/anaconda3/envs/chatglm3_v2/lib/python3.10/site-packages/awq/modules/linear/exllama.py:12: UserWarning: AutoAWQ could not load ExLlama kernels extension. Details: libcudart.so.12: cannot open sha…