-
I used the following steps to build SQ engine
First, build docker image from main branch
```
git clone -b main https://github.com/triton-inference-server/tensorrtllm_backend.git
# Update the su…
-
### 请提出你的问题
- https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#6-%E9%87%8F%E5%8C%96
请问在使用llm进行量化时,文档中 PaddleSlim 和 PaddlePaddle develop版本,但是安装了PaddlePaddle之后,并没有paddle.fluid,
但运行量化脚…
-
**Describe the bug**
I implemented SmoothQuant INT8 inference for PyTorch with `CUTLASS` INT8 GEMM kernels, which are wrapped as PyTorch modules in [torch-int](https://github.com/Guangxuan-Xiao/torch…
-
I came across this error when buiding llama-2-7b-hf after converting is to hf fast transformers format:
```
OSError: /llama/smooth_llama_7B/sq0.5/1-gpu does not appear to have a file named config.js…
-
支持awq8bit量化吗?
-
### System Info
CentOS Linux release 7.9.2009
Nvida A40 * 4
llama-2-13b-hf
TensorRT-LLM version: 0.11.0.dev2024061800
### Who can help?
_No response_
### Information
- [ ] The officia…
-
Hello,
after I couldn't use Ryzen AI on my Lenovo I got back to my Minisforum UM790 Pro where Ryzen AI is fortunately available on its 7940HS.
Your new examples are a great starting point. I al…
-
## Question
We are very interested in two post-training quantization papers from han lab!
SmoothQuant use W8A8 for efficient GPU computation.
AWQ uses W4/3A16 for lower memory requirements and …
-
如果做成perchannel效果怎么样,是不是就不需要reorder了?
-
Just did a very simple run with llama-7b-4bit. It... took a while. Had it run in a screen. But, it worked!
```
root@FriendlyWrt /s/o/llama.cpp (master)# time ./main --color -m models/ggml-model-q4…