-
Is it possible to add to https://nvidia.github.io/TensorRT-LLM/ the code copy widget that you already have on https://nvidia.github.io/TensorRT-Model-Optimizer/?
For example if you go to https://nvidi…
-
When I use the Llama 7B model after int8 quantization for inference, my throughput is only around 42 tokens/s, which is far lower than the 155 tokens/s stated in the documentation. Below is my executi…
-
I'm looking to do a lot of image quantization, and was/is searching for fast alternatives to K-means etc. Then I saw that there is a cuda implementation for NeuQuant. Although it is from 2011.
[Pap…
-
Error occurred when executing Joy_caption_load:
No package metadata was found for bitsandbytes
File "E:\ComfyUI-aki-v1.3\execution.py", line 317, in execute
output_data, output_ui, has_subgraph…
-
I would like to call your attention that this patent for optimizing quantization tables [image specific] have expired:
https://www.google.com/patents/US5724453
The paper can be found here:
http://w…
-
## Goal
- `cortex model pull` should have clear APIs that support different model repo sources
- e.g. Huggingface, Cortex Hub
## Tasklist
- [x] #1393
- [x] #1394
- [x] #1395
- [ ] #1398
## CLI
…
-
### Is there an existing issue for this problem?
- [X] I have searched the existing issues
### Operating system
macOS
### GPU vendor
Apple Silicon (MPS)
### GPU model
_No response_
### GPU VRA…
-
### Describe the issue
I do a qat quantization on a cnn model, when a export it to onnx model, and got a slower inference than torchscript qat model.
the result is
torchscript: 4.798517942428589 …
-
I use `python==3.10.3`, `unstructured==0.15.12`
```
from unstructured.partition.pdf import partition_pdf
```
```
PS C:\Users\ProjectName\test.py"
Traceback (most recent call last):
File "…
-
## 🐛 Bug
I tried this on both the 23 ultra and the 24
## To Reproduce
1.Using any model such as Qwen2_1_5B_q4f16_1 try to send a prompt.
I've tested many models and it seems to be model…