facebookresearch / ppuda

Code for Parameter Prediction for Unseen Deep Architectures (NeurIPS 2021)
MIT License
483 stars 60 forks source link

Fine tuning of network #6

Open CheukHinHoJerry opened 2 years ago

CheukHinHoJerry commented 2 years ago

Thank you for your work and release the code generously.

In the google colab sample, the accuracy of the model was about 60%. I was thinking if we could continue to train the predicted model and achieve higher accuracy. Ideally it would be faster than training a new model from scratch.

Have you tried this before?

bknyaz commented 2 years ago

Hi and thank you for your interest. I've tried this and other strategies to further improve fine-tuning results, but there are some challenges. I'm doing some work to alleviate these challenges that will be soon presented at the ICML workshop https://pretraining.github.io/. Stay tuned!

bknyaz commented 2 years ago

Hi, please have a look at Pull request #7 which allows to improve fine-tuning results. Also, see the report on arXiv corresponding to this pull request where fine-tuning results and training curves obtained using this code are shown when trained up to 300 epochs.

CheukHinHoJerry commented 2 years ago

Thanks a lot ! Will definitely take a look. Appreciated!

minhquoc0712 commented 1 year ago

Hi, in the paper "Pretraining a Neural Network before Knowing Its Architecture", can you explain why the orthogonal re-initialization is not applied: "Furthermore, the orthogonal re-initialization step introduced next in Section 3.2 is not beneficial or applicable to some layers (e.g. first layers or batch normalization layers). "

Also in that paper, in PCA visualization in Figure 4, how can you get the vector representation for each architecture? I thought each operation in architecture has its encoding vector.

bknyaz commented 1 year ago

Hi, sorry for the late reply.

  1. This is an empirical observation and we don't know exactly why orthogonalization helps only in certain layers. One potential reason is illustrated in the "Orthogonal Convolutional Neural Networks" paper. Their Fig. 1a shows that in the first layers the weights are not as similar and redundant as in deeper layers. Our Fig. 2 shows a similar trend. So when performing orthogonalization of the predicted weights in the first layers the weights do no become significantly less correlated, because they were not too correlated in the first place. But when performing their orthogonalization, we may destroy some useful features in these weights (because of the qr decomposition). So in the first layers, there seems to be more negative effects of orthogonalization. In contrast, in deeper layers the filters are correlated too much, so even though orthogonalization may destroy some features, its overall effect is more positive.
  2. About the PCA in Fig. 4, we average all node embeddings to get an architecture embedding (see the footnote at the bottom of page 8 in our GHN-2 paper https://arxiv.org/pdf/2110.13100.pdf).