AWQ performance on RTX3090, with flash_attn2

nmandic78 commented 1 month ago

I get ~44 seconds inference time using demo code on Qwen2-VL-7B-Instruct-AWQ with 774x772 image input, using flash_attn2, bfloat16, with output of 104 tokens: ['The image shows a round wall clock with a black frame and a white face. The clock has black hour, minute, and second hands. The numbers from 1 to 12 are displayed in black, with each number being bold and easy to read. The clock is mounted on a black bracket that is attached to the wall. The brand name "VISIPLEX" is printed in blue at the bottom of the clock face. The background is plain and neutral, which makes the clock the focal point of the image.'] Inference time: 43.6408166885376 Is that expected performance? Am I missing something? model = Qwen2VLForConditionalGeneration.from_pretrained( model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="cuda", )

kq-chen commented 1 month ago

2 tokens/s is significantly lower than the speed-benchmark we have tested, which is around 30 tokens/s.

Your environment may not be configured correctly. You can check if autoawq-kernel is installed correctly. Noteworthy, the dependencies for autoawq/autoawq-kernel are quite strict, and it appears that the latest version of autoawq only supports torch 2.3.1.

Maybe you can try setting up your environment with:

pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/huggingface/transformers.git@27903de
pip install autoawq==0.2.6

Here, use the 27903de commit of transformers due to a recent bug. In short, #31502 causes an error 'torch.nn' has no attribute 'RMSNorm' for torch<2.4. The corresponding fix has not yet been merged. Therefore, we are installing a version of transformers before #31502.

it seems there isn't a precompiled flash-attn wheel for cu121. If you prefer not to compile it yourself, you can simply do:

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
)

nmandic78 commented 1 month ago

Thank you for response and advice.

I tried as you suggested. I was on torch-2.4.0 and autoawq-0.2.3, now on torch-2.3.1 and autoawq-0.2.6. Tried without flash_attn and result is same (46.81 sec for example above ~2 t/s I see some warnings (warnings.warn(f"AutoAWQ could not load ExLlama kernels extension. Details: {ex}")) and I don't know if it has any influence? I again tried with flash_attn and get almost the same speed (43.40 sec, same input) Maybe I missed something, but am happy to wait a week until all interdependencies are resolved.

Anyway, great new VL model; beside my performance issues, results are great!

QwenLM / Qwen2-VL

AWQ performance on RTX3090, with flash_attn2 #8