-
Hi,
Although current generate.py use fast WaveNet generation algorithm, but is still too slow.
Is it available for the network to be quantized?
But in TF tutorial, it said we need to specify …
-
### Problem Description
Seeing GPU fault when running the onnxruntime-inference-examples script using reduced layer bert models during benchmarking.
It appears quantization/calibration steps work …
-
Hi, I've been following your tutorial to compile Llava 1.5 7B VLM and I was able to compile everything successfully. However, when I run the app, I get the following error:
![image](https://github.…
-
xft version:1.8.2
lscpu:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 48 bits virtual
Byte Order: Little End…
-
### Description of the bug:
Hello,
I'm encountering an issue when trying to export a model to tflite with quantization. It appears that the tensor shapes are being altered incorrectly somewher…
-
I am running torchao: 0.5 and torch: '2.5.0a0+b465a5843b.nv24.09' on an NVIDIA A6000 ADA card (sm89) which supports FP8.
I ran the generate.py code from the benchmark:
python generate.py --c…
-
@casper-hansen Hi, I have a question about the awq quantization model on HuggingFace, [https://huggingface.co/TheBloke/Llama-2-7B-AWQ/tree/main?show_file_info=model.safetensors](url).
The shapes o…
-
Hello, thank you for your work, it helps me a lot.
But the current video memory still puts me under pressure.
I wonder if it can be further quantified, such as 2bit? Can you provide a referenc…
-
**Why is it that when using a quantitative model for inference, the TTFT optimization is not obvious, but the overall inference efficiency is improved a lot? At the same time, the inference efficiency…
-
Error occurred when executing Joy_caption_load:
No package metadata was found for bitsandbytes
File "E:\ComfyUI-aki-v1.3\execution.py", line 317, in execute
output_data, output_ui, has_subgraph…