-
- [ ] [Guide to choosing quants and engines : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1anb2fz/comment/kprbduc/)
# Guide to choosing quants and engines : r/LocalLLaMA
**DESCRIPTIO…
-
Looking at the weight values, we see that they are bfloat16.
Further, conversion to ternary is done at run-time (in FusedBitLinear).
To see if the model still worked with ternary weights, I re-wro…
-
Hi!
Thanks for such a useful tool!
I have a question about `model_seqlen`:
As I can see default value in main.py is 4096. What if I'll use a smaller values e.g. 1024 when quantizing MoE mixtral m…
-
rtx-4090多卡推理(模型为qlora微调后qwen72b)是否支持?通过FSDP+QLoRA,可以正常对qwen-72b的模型进行微调,想问一下,如何使用rxt-4090对其进行推理部署呢?
我尝试使用如下的脚本进行多卡推理:
```
CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch --config_file fsdp_config.y…
-
Dear Authors,
Thanks for your outstanding work. I like it and have learned a lot from it!
I try to reproduce the weight-only quantization results in Table 5. However, I obtained some results tha…
-
First of all, thank you for your work. I've been able to efficiently run many models locally using exllamav2, which is a highly efficient inference library.
Recently, I tried to use exllamav2-0.0.21 …
-
Hi @czhu95 ,
Thanks for providing the codes!
Recently I use your codes to ternarize a ResNet18 using CIFAR10. Firstly I use tensorpack to train a ResNet18 to validation error as 0.083. However, …
-
Because vllm-gptq does not open issue,so I raise issue here.
https://mobiusml.github.io/hqq_blog/
HQQ is a fast and accurate model quantizer that skips the need for calibration data. It's super …
-
I am using the latest vllm docker image, trying to run Mixtral 8x7b model quantized in AWQ format. I got error message as below:
```
INFO 12-24 09:22:55 llm_engine.py:73] Initializing an LLM engine …
-
(.venv) (base) mikekg@mikekg-mbp torchchat % # Llama 3 8B Instruct
python3 torchchat.py chat llama3
zsh: command not found: #
Using device=cpu Apple M1 Max
Loading model...
Time to load model: 10…