changlin31 / BossNAS

(ICCV 2021) BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search
135 stars 20 forks source link

Some questions about BossNAS #1

Closed huizhang0110 closed 3 years ago

huizhang0110 commented 3 years ago

Hi, thanks for your excellent work~

It is inspiring and practical for improving the sub-net ranking correlations. But I have a few questions.

  1. Although it is beneficial to upgrade the Ranking correlation on each small stage by progressively searching, will it lead to the accumulation of error? The best path of the previous stage maybe not suitable for the following. How could explain it?
  2. Why is the ResAttn operator only searched on depth=1/2?
  3. On the hybrid search space, ensemble different resolutions output is weird, since it discards the structure information by adaptive pooling, so I don't know why it can be suitable.
  4. As shown in Table~4, Unsupv. EB is better than Supv. class. Do you have a theoretical explanation about it?
changlin31 commented 3 years ago

Hi, @huizhang0110 Thanks for your interest.

  1. In Block-wise NAS (including DNA and other previous works), the proxy task (evaluation metric, Eqn. 1 in BossNAS paper) is set to the (weighted) sum of the loss of each stage. Assuming the loss of each stage is independent, to minimize the sum of the loss, the solution would be the sum of minimum loss of each stage. But as you pointed out, in our current setting, the loss is correlated with the best path of the previous stage. This scheme is adopted because it yields slightly better experimental results. You can also try always using the first path (e.g. [0, 0, 0, 0]) as the path in previous stages. The results are similar. It is worth-noting that we can combine architectures in each stage freely and perform search with their saved ratings(i.e. our method is not greedy search).

  2. First, ResAttn are much heavier and slower than ResConv if using spatial size (i.e. sequence length) n that are larger than 14x14, due to its O(n^2) complexity, which makes it an unfair competition among the two candidates. Second, ResAttn with large spatial size consumes large amount of GPU memory, making the searching process inefficient. You can try replace our ResAttn with other more efficient Transformer blocks with O(nlogn) or O(n) complexity, such as Performer blocks and Swin blocks to perform search with Attn operator on all the depth.

  3. We project the outputs (from different views of the same image) with different resolutions to the latent space and ensemble the latent vector (which is similar if the network is well trained). Note that the outputs are projected by MLPs after pooling. There should be no conflicts among the outputs.

  4. In Supv. class, every blocks are trained by directly predicting the label. Ensemble Bootstrapping may provide better intermediate targets for the blocks (as it is generated by the block it self), and thus lead to higher ranking correlations.

huizhang0110 commented 3 years ago

Thanks, your clear explanation addressed my concerns.