-
### System Info
CPU:X_86_64
GPU: A10
OS: Ubuntu 22.04
### Who can help?
@Tracin @byshiue please help.
### Information
- [X] The official example scripts
- [ ] My own modified script…
-
### System Info
- CPU architecture: x86_64
- GPU properties
- GPU name: NVIDIA A100
- GPU memory size: 40G
- Libraries
- TensorRT-LLM branch or tag: v0.10.0
- Container used: yes, `ma…
-
### 是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
### 该问题是否在FAQ中有解答? | Is there an existing ans…
CCRss updated
1 month ago
-
Hi,
I tried QuantLinear from qlinear_cuda under auto_gptq.nn_moduldes.qlinear but its performance is low with skinny matmul (i.e. matmul shapes at token gen).
Its perf is even worse than fp32
for e…
-
运行在容器中 running in a container
使用gpu会报显存不足 use gpu
```shell
ERROR: Model running Error: CUDA out of memory. Tried to allocate 2.37 GiB. GPU 0 has a total capacty of 23.69 GiB of which 2.03 GiB is fr…
-
A recurring feature request — provide automatic chemistry detection, at least in the case where we know that the input data is 10x. This would look something like passing `-c auto10x` and `simpleaf` …
rob-p updated
1 month ago
-
### System Info
```shell
optimum-habana 1.14.0.dev0
HL-SMI Version: hl-1.18.0-fw-53.1.1.1
Driver Version: 1.18.0-ee698fb
```
### Information
- [X] The off…
-
hi, How do i improve the inference time of my Llama2 7B model?....
i used BitsAndBytesConfig also but this does not seem to fasten the inference time!
code:
`name = "meta-llama/Llama-2-7b-cha…
-
would be really nice to have a functionary version of llama 3.1 70b/8b!
-
### System Info
Ubuntu
### Reproduction
model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bflo…