Flash attention and attention mask modification. Does the model support flash attention? - Githubissues

Yxxxb / VoCo-LLaMA

VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".

https://yxxxb.github.io/VoCo-LLaMA-page/

Apache License 2.0

84 stars 4 forks source link

Flash attention and attention mask modification. Does the model support flash attention? #9

Closed denadai2 closed 5 months ago

denadai2 commented 5 months ago

Dear authors, first of all congrats for your idea and paper!!

I have a question about the code. I see here https://github.com/Yxxxb/VoCo-LLaMA/blob/79859d0ad7df2f322ae7b06c58d246b062f39ffd/llava/model/language_model/llava_llama_1stg.py#L263 that in flash attention you do not modify the attention mask. Is it expected?

thanks

Yxxxb commented 5 months ago

Hi,

Thank you for your interest. Since we need to use a 4d attention mask, but the open source flash attention only supports a 2d casual attention mask, we chose the standard sdpa and modified the attention mask based on that.