facebookresearch / ToMe

A method to increase the speed and lower the memory footprint of existing vision transformers.

Other

931 stars 67 forks source link

hyper-parm "r" in ViT-g/14 #28

Closed rese1f closed 1 year ago

rese1f commented 1 year ago

Hi! In the paper, ViT-L (24 layers) with r=8, do you have any suggestion for r in ViT-g (40 layers)? Thank you!

dbolya commented 1 year ago

That depends on the image size (i.e., number of tokens you start with). I'd just set r = floor(#tokens / #layers), which would be 4 (or potentially 5) if your ViT-g model has 196 tokens (which I doubt it does but just an example).

rese1f commented 1 year ago

So in the end, should the # tokens be super small? In my case, i use 448x448 (1025 tokens) with 40 layers, so I use r = 25, and get

token: 1025->1000->975->950->925->900->875->850->825->800->775->750->725->700->675->650->625->600->575->550->525->500->475->450->425->400->375->350->325->300->275->250->225->200->175->150->125->100->75->50

am I right?

dbolya commented 1 year ago

Yeah, that looks good to me. Looks like you can even increase r by one there, but that should be good to get started.

rese1f commented 1 year ago

Thank you so much!