Open Linfengscat opened 5 years ago
I believe that the code works this way already. The optimizer of the model only contains wight parameters and the optimizer in architecture does alpha and beta only. Please correct me if it isn't right.
@HankKung Sorry I was careless, Thanks
I think it would be better if we train the network weights and the architecture weights separately, to be exact , frozen the grad of α,β when updating w, also frozen the gradient of w when updating α,β.
By the definition of: