-
### Your current environment
[root@localhost wangjianqiang]# python -m vllm.entrypoints.openai.api_server --model /root/wangjianqiang/deepseek-moe/deepseek-coder-33b-base/ --tensor-parallel-size 8 …
-
Hi:
I tried QAT on a model and exported the encodings. Then, I used the qnn-onnx-converter with --quantization_overrides and --input_list trying to put min/max/scale value after QAT into the converte…
-
now I have my own pytorch and onnx model.
how can I quantize it using glow in python API, and then how can I inference it in glow?
is there any clear doc?
thanks.
-
### System Info
- CPU:4090 * 4
- TensorRT-LLm : v0.8.0
- CUDA Version: 12.3
- NVIDIA-SMI 545.29.06
### Who can help?
_No response_
### Information
- [X] The official example scripts
…
-
### Describe the issue
Onnxruntime transformers benchmarking is failing for int8 quantized inference. the same is working fine with onnxruntime 1.16.3. I added the error details below.
I found the b…
-
When running `examples/quantization/basic_usage_gpt_xl.py` an error occurs during the model packing:
```
2023-05-22 04:08:34 INFO [auto_gptq.quantization.gptq] duration: 0.16880011558532715
2023-…
-
I m trying to save a int4 quantized model. When i try to save it , i get this error when trying to solve the issue.
Traceback (most recent call last):
File "C:\Users\AI-Perf\Varsha\ipex-llm\pytho…
-
Hi, I'm trying to run Llamaspeak following the Instructions on https://www.jetson-ai-lab.com/tutorial_llamaspeak.html
Specs:
Jetson Orin NX(16GB) Developer Kit
Jetpack 6.0 [L4T 36.3.0]
The RI…
-
Only q4_0_4_4 gguf are running in my Poco X6 pro phone. CPU-Z said it have cortex A510 and A715 cores. They are support both i8mm and sve. When i tried to run a gguf what needs it this happens:
~/…
-
I tried to quantize a Llama model (Llama 13b) by smooth quant, and found that if I only quantize `LlamaDecoderLayer` then the accuracy would not drop even directly quantize weights and activations, bu…