Closed rese1f closed 1 year ago
That depends on the image size (i.e., number of tokens you start with). I'd just set r = floor(#tokens / #layers)
, which would be 4 (or potentially 5) if your ViT-g model has 196 tokens (which I doubt it does but just an example).
So in the end, should the # tokens be super small? In my case, i use 448x448 (1025 tokens) with 40 layers, so I use r = 25, and get
am I right?
Yeah, that looks good to me. Looks like you can even increase r by one there, but that should be good to get started.
Thank you so much!
Hi! In the paper, ViT-L (24 layers) with r=8, do you have any suggestion for r in ViT-g (40 layers)? Thank you!