Oryx-mllm / Oryx

MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
https://oryx-mllm.github.io
291 stars 14 forks source link

[Question] the 'variable-length attention operator in flash attention' #2

Closed jungle-gym-ac closed 2 months ago

jungle-gym-ac commented 2 months ago

Hi there! It is mentioned in the paper that "variable-length attention operator provided in flash attention (Dao et al., 2022) to compute the attention for each visual input within the batch independently". However, I read the code in here and did not find code related to this variable-length attention operator, and the high-resolution features are encoded with a for loop. Did I miss something? Thank you!

liuzuyan commented 2 months ago

Hi, thanks for your interest in our work! We pre-process the input images into a list in the code you implemented and then forward the whole list to OryxViT for batch computation here. The variable-length attention is operated here. Feel free to ask should you have further questions!

jungle-gym-ac commented 2 months ago

Ah thanks! I read the code again and figured it out.