-
I have a model gemma 2 9B. I quantized this with AWQ-4bit. Size of model is 5.9GB. I set the kv_cache_free_gpu_mem_fraction to 0.01 and run triton on one A100. But triton takes 10748MiB of ram. I expe…
-
When i try to run the start script i get this error message
```
(h2ogpt) C:\Users\domin\Documents\aiGen\h2ogpt>python generate.py
Fontconfig error: Cannot load default config file: No such file: …
-
I used awq to build the codellama-13b quantized npz model file to tensorrt format, but encountered this error. My command was as follows:
python build.py --model_dir /app/models/CodeLlama-13b-hf/ \…
-
### Search before asking
- [X] I searched the [issues](https://github.com/ray-project/kuberay/issues) and found no similar issues.
### KubeRay Component
ray-operator, apiserver
### What happened …
-
**Describe the bug**
C# Version 0.5.0 broke DML models, such as microsoft--Phi-3-mini-4k-instruct-onnx directml-int4-awq-block-128.
The model loads, but the Generator's constructor throws an Access vi…
-
### 📚 The doc issue
文档里面提到打开 search-scale 和 batch-size 可以提高精度,想问一下打开和默认关闭 search-scale 是有什么区别呢,我看了一下代码,我的理解是 search-scale 使用了 grid-search 类似论文中的 AWQ,而默认关闭是走的是 SmoothQuant 么,还是减去了网格搜索的过程,默认 scale = 0.…
-
### Your current environment
The output of `python collect_env.py`
```text
Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch…
-
### 是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
### 该问题是否在FAQ中有解答? | Is there an existing ans…
-
**Describe the bug**
When I run the example from examples/python/awq-quantized-model.md, but switching out phi-3 for llama-3.2-3b, I get an error message stating that `AttributeError: 'NoneType' objec…
-
### System Info
CPU x86_64
GPU NVIDIA L20
TensorRT branch: v0.8.0
CUDA: NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.3
### Who can help?
@Tracin
### Information
- [X…