Used in the vanilla Transformer

facebookresearch / ToMe

A method to increase the speed and lower the memory footprint of existing vision transformers.

Other

970 stars 69 forks source link

Used in the vanilla Transformer #30

Closed yuzhenmao closed 1 year ago

yuzhenmao commented 1 year ago

Hi! Thanks for the amazing work! In the paper, ToMe is only used in ViT. I am wandering that if ToMe can be applied to the vanilla Transformer. In that case, I guess it is similar to set the patch-size to be 1. Have you tried something similar and please correct me if I said something wrong. Thanks!

dbolya commented 1 year ago

I haven't tried it, but I don't see any reason why it wouldn't work. You'd just have to increase the number of tokens merged in each layer (r) by a lot.

yuzhenmao commented 1 year ago

Hi! Thank you for your previous reply. That makes sense to me. A follow-up question is that if ToME can be applied to the LLMs with causal mask? Thanks!

dbolya commented 1 year ago

I haven't tried, but it may be possible to average together the causal masks just like we do the tokens. However, I'm not sure if that would produce good results (given that it will kinda break causality).

yuzhenmao commented 1 year ago

Thank you so much for your reply! I am not sure about "average together the causal masks". Do you mean when average two similar tokens, because they have different causal masks, simply averaging can violate the causality? Thanks!

dbolya commented 1 year ago

A causal mask is technically just a vector that's 1 for everything after and including the current token, and 0 for everything before. Thus, technically you could average these together and have 0.5 for everything between the two tokens you've averaged. I'm not sure this would work, though.

yuzhenmao commented 1 year ago

Thanks!