Describe the Question
It's not a bug report. I just want to understand some details about the GDAS algorithm.
I am curious about the sampling process in each iteration. From my understanding, it seems that in GDAS, operations are sampled twice: once for optimizing the network weights and again for optimizing the architecture weights (as the below pseudo-code). I noticed this in this repository, but I wanted to verify if my understanding is correct.
while not converge:
// train network weight
sample batches from D_t
sample one operation o1
forward via operation o1
update sub-network's weight by gradient descent
// train architecture weight
sample batches from D_v
sample another operation o2
forward via operation o2
update architecture weight by gradient descent
Additionally, if the description above is correct, I wanted to inquire about the reasoning behind this approach. In my opinion, after training the weights of the operation being sampled, it seems more reasonable to adjust its architecture weights rather than sampling another operation and adjusting its corresponding architecture weights (use the same operation o1 for both training network weight and architecture weight in a single iteration). I would appreciate it if you could provide some insights or clarifications on this aspect.
Which Algorithm GDAS
Describe the Question It's not a bug report. I just want to understand some details about the GDAS algorithm.
I am curious about the sampling process in each iteration. From my understanding, it seems that in GDAS, operations are sampled twice: once for optimizing the network weights and again for optimizing the architecture weights (as the below pseudo-code). I noticed this in this repository, but I wanted to verify if my understanding is correct.
Additionally, if the description above is correct, I wanted to inquire about the reasoning behind this approach. In my opinion, after training the weights of the operation being sampled, it seems more reasonable to adjust its architecture weights rather than sampling another operation and adjusting its corresponding architecture weights (use the same operation
o1
for both training network weight and architecture weight in a single iteration). I would appreciate it if you could provide some insights or clarifications on this aspect.Thanks for your time and consideration.