Open nmandic78 opened 1 month ago
2 tokens/s is significantly lower than the speed-benchmark we have tested, which is around 30 tokens/s.
Your environment may not be configured correctly. You can check if autoawq-kernel
is installed correctly. Noteworthy, the dependencies for autoawq
/autoawq-kernel
are quite strict, and it appears that the latest version of autoawq
only supports torch 2.3.1
.
Maybe you can try setting up your environment with:
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/huggingface/transformers.git@27903de
pip install autoawq==0.2.6
Here, use the 27903de
commit of transformers
due to a recent bug. In short, #31502 causes an error 'torch.nn' has no attribute 'RMSNorm'
for torch<2.4
. The corresponding fix has not yet been merged. Therefore, we are installing a version of transformers before #31502.
it seems there isn't a precompiled flash-attn wheel for cu121. If you prefer not to compile it yourself, you can simply do:
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
)
Thank you for response and advice.
I tried as you suggested. I was on torch-2.4.0 and autoawq-0.2.3, now on torch-2.3.1 and autoawq-0.2.6. Tried without flash_attn and result is same (46.81 sec for example above ~2 t/s I see some warnings (warnings.warn(f"AutoAWQ could not load ExLlama kernels extension. Details: {ex}")) and I don't know if it has any influence? I again tried with flash_attn and get almost the same speed (43.40 sec, same input) Maybe I missed something, but am happy to wait a week until all interdependencies are resolved.
Anyway, great new VL model; beside my performance issues, results are great!
I get ~44 seconds inference time using demo code on Qwen2-VL-7B-Instruct-AWQ with 774x772 image input, using flash_attn2, bfloat16, with output of 104 tokens: ['The image shows a round wall clock with a black frame and a white face. The clock has black hour, minute, and second hands. The numbers from 1 to 12 are displayed in black, with each number being bold and easy to read. The clock is mounted on a black bracket that is attached to the wall. The brand name "VISIPLEX" is printed in blue at the bottom of the clock face. The background is plain and neutral, which makes the clock the focal point of the image.'] Inference time: 43.6408166885376 Is that expected performance? Am I missing something? model = Qwen2VLForConditionalGeneration.from_pretrained( model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="cuda", )