Closed TranThanh96 closed 1 year ago
The issue with this is that (from my understanding) the mobilenet / mobilevit models usually have a pyramid-like architecture, with poolings, so it is not immediately easy to perform distillation on the patch features because the output maps have different dimensionality; however if you have an efficient architecture in mind that would output e.g. 16x16 feature maps + a class token, we could consider it.
Closing to keep track of similar asks in #166 instead
The issue with this is that (from my understanding) the mobilenet / mobilevit models usually have a pyramid-like architecture, with poolings, so it is not immediately easy to perform distillation on the patch features because the output maps have different dimensionality; however if you have an efficient architecture in mind that would output e.g. 16x16 feature maps + a class token, we could consider it.
how about this? https://github.com/snap-research/EfficientFormer https://github.com/apple/ml-fastvit
as the title, can you guys try this ? I think that is good if we have a light weight version for mobile