Optimizing the searched network

romulus0914 commented 4 years ago

Thanks for your excellent work!

I would like to ask two questions about optimizing the searched network and applying KD algorithm.

First. In my understanding, the searched network is randomly initialized and optimized to learn from the un-pruned network. Why doesn't it be initialized with the weights that are already trained in the un-pruned network and further fine-tuned, as in the three-stage pruning paradigm (training a large network, pruning, re-training)? Please point me out if I misunderstood.

Second. I believe that KD algorithm has been slightly modified to suit the case. Especially in Eq. (9), it looks like a cross entropy term in the form of "-sigma(P(x)log(Q(x)))". In my understanding, the distribution P is the true distribution and the distribution Q is the estimated distribution. Therefore, I wonder if there is a mistake in Eq. (9), which I suggest that it be "-sigma(P(z_hat)log(Q(z)))" instead of "-sigma(P(z)log(Q(z_hat)))".

Thanks for your attention.

romulus0914 commented 4 years ago

Pardon my carelessness in the first question. I suppose the "TAS w/ Init" in Table 1 is what I mentioned above.

D-X-Y commented 4 years ago

First. Your understanding is correct.
Second. It is our mistake, we will fix it in our revision. Thanks!

D-X-Y / AutoDL-Projects

Optimizing the searched network #27