-
### System Info
CPU x86_64
GPU NVIDIA A10
TensorRT branch: main
commid id:cad22332550eef9be579e767beb7d605dd96d6f3
CUDA:
NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: …
-
## Description
Using trt llm to generate llama classification model. I have two similar script to generate engine, the first is raw scripts, the second is base on example/llama/build.sh script.
Howev…
-
Hi,
Thanks for your outstanding work. I have tested the quantized model using the W4A16 kernel on the WikiText2 datasets. Specially, the WikiText2 validation datasets is split into non-overlapping…
-
### 🚀 The feature, motivation and pitch
I'm working on applications that must run locally in resource-limited HW. Threrefore, quantization becomes essential. Such applications need from multimodal vi…
-
ViT是采用智源出品的EVAViT,EVA在Vit的基础上增加了2D rotary position embedding (RoPE) ,请问这个2D的位置编码能够更好的处理像素度较高的图片嘛?
-
### The model to consider.
Thanks to the efforts of the vllm team.
Recently, I am preparing to optimize the inference performance of WeMM, with the link provided below.
https://huggingface.co/f…
-
This pattern of mixing in numpy and MLX inside the model forward will really slow things down. It forces a synchronization at each layer and breaks asynchronous evaluation:
https://github.com/Blaiz…
-
Attempting to generate with Mistral Small causes this error:
---------------------------------------------------------------------------
RuntimeError Traceback (most r…
-
### Is there an existing issue for this?
- [X] I have searched the existing issues
### Current Behavior
So as what I am seeing. For the standard MultiHeadAttention, there is a procedure:
input -> …
-
This can be reproduced by cloning latest Megatron-LM and enabling transformer_engine for `--transformer-impl` instead of using local implementation.
The experiments are run in a `nvcr.io/nvidia/pyt…