Hi there! It is mentioned in the paper that "variable-length attention operator provided in flash attention (Dao et al., 2022) to compute the attention for each visual input within the batch independently". However, I read the code in here and did not find code related to this variable-length attention operator, and the high-resolution features are encoded with a for loop. Did I miss something?
Thank you!
Hi, thanks for your interest in our work! We pre-process the input images into a list in the code you implemented and then forward the whole list to OryxViT for batch computation here. The variable-length attention is operated here. Feel free to ask should you have further questions!
Hi there! It is mentioned in the paper that "variable-length attention operator provided in flash attention (Dao et al., 2022) to compute the attention for each visual input within the batch independently". However, I read the code in here and did not find code related to this variable-length attention operator, and the high-resolution features are encoded with a for loop. Did I miss something? Thank you!