Open subneed opened 1 year ago
I'm not too familiar with FMNs, but it seems like it's a hierarchical network with a different attention mechanism? In principle you can use ToMe on anything that uses tokens, but like you said you'd need to be careful about the downsampling layers. You might be able to use ToMe instead of those downsampling layers, but that would probably require some exploration to figure out what's best.
This problem is still up for debate in the research world, so we can only answer things that have already been covered in our paper.
any help on modifying ToMe for focal modulation networks? I guess in FMN we could apply to me on Q/M. Also it has downsampling layers in each stage, so r value changes each stage and model definition?