Training Algorithm while Searching

romulus0914 commented 5 years ago

First of all, thanks for your great work!

However, I have some questions about training algorithm while searching. Since in back-propagation, soft max function is applied, will the gradients generated in one path (arg max function) also update the weights in other paths? Or the gradients will only update the weights on the sampled path?

Another question is about a description in Sec 3.2 Acceleration from the paper, "Within one training batch, each sample produces a different hi,j , and, therefore, each element in Ai,j has a high possibility of being updated with gradients" In my understanding, if each sample in the mini-batch sampled a different architecture, it will performs like batch size is 1.

Thanks again!

D-X-Y commented 5 years ago

Thanks for your interest. First, yes, the gradients will also update weights in other paths. Second, sorry for the confusing, each sample in here refer to each cell, the batch of data in each GPU share the same hi,j. But different cell samples different operation.

romulus0914 commented 5 years ago

Thanks for your reply!

qinb commented 5 years ago

Thanks for your interest. First, yes, the gradients will also update weights in other paths. Second, sorry for the confusing, each sample in here refer to each cell, the batch of data in each GPU share the same hi,j. But different cell samples different operation.

我用中文问一下吧，😄，期待大神的回答！

问题1：model里面所有的cell是共用一个alpha【即arch parameter】吗？

个人理解：在DARTS中，是所有cell共用一个alpha。其中，一个cell里面包含4个node，共有14个edge，每个 edge有8个Function【比如：zero,conv等】，故alpha的形状就是14x8。我看你最后的回答：

Second, sorry for the confusing, each sample in here refer to each cell, the batch of data in each GPU share the same hi,j. But different cell samples different operation.

你这边表达的意思是：每个cell里面都使用自己alpha？即假设你的每个cell里面的alpha的size是14x8，那么你的arch-parameter： cell_num x 14 x 8，是这么理解的吗？但是，如果所有cell共用一个alpha的话，那么每个cell在选择Function的时候不都选择一样了嘛？很困惑，期待解答！

问题2：在对arch-parameter进行back-propagation时，是仅仅更新选中的arch-parmater，未选中的不更新？

个人理解：按照论文的意思：仅仅更新one-path上的arch-parameter，其他path并不更新；

Acceleration. In Eq. (5), hi,j is a one-hot vector. As a result, in the forward procedure, we only need to calculate the function Farg max(hi,j ) . During the backward procedure, we only back-propagate the gradient generated at the arg max(h˜i,j ). In this way, we can save most computation time and also reduce the GPU memory cost by about |F| times.

我看你这边的回答：

First, yes, the gradients will also update weights in other paths.

有点迷，恳请大神帮忙解答，跪谢！ @D-X-Y

D-X-Y commented 4 years ago

Sorry for the late reply.

1, Yes, all cells share the same alpha. However, since I use Gumbel-Softmax, the generated probs from Gumbel-Softmax could be different across different cells.

2, Please see details at https://github.com/D-X-Y/NAS-Projects/blob/master/lib/models/cell_searchs/search_cells.py#L65

D-X-Y / AutoDL-Projects

Training Algorithm while Searching #18