Closed huizhang0110 closed 3 years ago
Hi, @huizhang0110 Thanks for your interest.
In Block-wise NAS (including DNA and other previous works), the proxy task (evaluation metric, Eqn. 1 in BossNAS paper) is set to the (weighted) sum of the loss of each stage. Assuming the loss of each stage is independent, to minimize the sum of the loss, the solution would be the sum of minimum loss of each stage. But as you pointed out, in our current setting, the loss is correlated with the best path of the previous stage. This scheme is adopted because it yields slightly better experimental results. You can also try always using the first path (e.g. [0, 0, 0, 0]) as the path in previous stages. The results are similar. It is worth-noting that we can combine architectures in each stage freely and perform search with their saved ratings(i.e. our method is not greedy search).
First, ResAttn are much heavier and slower than ResConv if using spatial size (i.e. sequence length) n that are larger than 14x14, due to its O(n^2) complexity, which makes it an unfair competition among the two candidates. Second, ResAttn with large spatial size consumes large amount of GPU memory, making the searching process inefficient. You can try replace our ResAttn with other more efficient Transformer blocks with O(nlogn) or O(n) complexity, such as Performer blocks and Swin blocks to perform search with Attn operator on all the depth.
We project the outputs (from different views of the same image) with different resolutions to the latent space and ensemble the latent vector (which is similar if the network is well trained). Note that the outputs are projected by MLPs after pooling. There should be no conflicts among the outputs.
In Supv. class
, every blocks are trained by directly predicting the label. Ensemble Bootstrapping
may provide better intermediate targets for the blocks (as it is generated by the block it self), and thus lead to higher ranking correlations.
Thanks, your clear explanation addressed my concerns.
Hi, thanks for your excellent work~
It is inspiring and practical for improving the sub-net ranking correlations. But I have a few questions.
Unsupv. EB
is better thanSupv. class
. Do you have a theoretical explanation about it?