Cross-architecture weight selection

Hi @j93hahn,

Thanks for interest in our paper. And yes, we primarily focus on within-model-family weight transfer.

From a practical perspective, there is no good reason to initialize a much smaller CNN with a large pretrained ViT.

When size differs by a lot, weight selection tends to perform worse.
There is no one-to-one component mapping between CNN and ViT. Convolutions are quite different from attention.

In openreview of our paper, we have conducted an initial experiment on cross-arch weight initialization (initialize ViT-T with isotropic ConvNeXt-S). But the performance gain is far lower than within-model-family initialization.

Please feel free to contact me by email in case you want to discuss details of your approach or idea.

Best, Oscar

OscarXZQ / weight-selection

Cross-architecture weight selection #7