OscarXZQ / weight-selection

170 stars 12 forks source link

Cross-architecture weight selection #7

Open j93hahn opened 2 months ago

j93hahn commented 2 months ago

Thanks for open-sourcing the code! I have a question - your paper seems to revolve around mono-architectural weight initialization. What if I want to use a very large pretrained ViT to initialize a much smaller CNN?

Using the weights alone doesn't seem as pertinent, especially since CNNs and ViTs do not carry the same inductive biases. Do you know of any papers exploring this direction?

OscarXZQ commented 2 months ago

Hi @j93hahn,

Thanks for interest in our paper. And yes, we primarily focus on within-model-family weight transfer.

From a practical perspective, there is no good reason to initialize a much smaller CNN with a large pretrained ViT.

  1. When size differs by a lot, weight selection tends to perform worse.
  2. There is no one-to-one component mapping between CNN and ViT. Convolutions are quite different from attention.

In openreview of our paper, we have conducted an initial experiment on cross-arch weight initialization (initialize ViT-T with isotropic ConvNeXt-S). But the performance gain is far lower than within-model-family initialization.

Please feel free to contact me by email in case you want to discuss details of your approach or idea.

Best, Oscar