Open yangsenius opened 5 years ago
@yangsenius you remind me! Thank you. Have you try :+1:
self.alpha_normal = torch.randn(k, num_ops)
self.alpha_reduce = torch.randn(k, num_ops)
What's the performance when you update the code with above statement? Please tell me if you re-run the exp.
self.alpha_normal = torch.randn(k, num_ops)
self.alpha_reduce = torch.randn(k, num_ops)
mill make self.alpha_normal
and self.alpha_reduce
always be torch.floatTensor, somtimes causing error with model.cuda()
, this is a little trouble.
maybe
self.alpha_normal = torch.randn(k, num_ops, dtype = self.your_conv.dtype)
self.alpha_reduce = torch.randn(k, num_ops, dtype = self.your_conv.dtype)
is OK?
or
just
self.alpha_normal = nn.Parameter(torch.randn(k, num_ops))
def filter(model):
for name, param in model.name_parameters():
if 'alpha' in name:
contiue
yield param
optimizer = torch.optim.Adam(filter(model),)
What do you think about ? Does it have a better code implementation about this issue?
since we usually set device to 'cuda:0', the
self.alpha_reduce = torch.randn(k, num_ops, dtype = torch.device("cuda"))
would be ok option. and see any problems.
@yangsenius
Hi, I also noticed this problem yesterday. I think that making the parameters into two groups maybe a good choice. When training a ConvNet (ie. MobileNet), we always make the weights or parameters of conv having decay : 5e-4, but no decay for BN, so we will define
optimizer = SGD([{param groups1 for conv with decay}, {param groups2 for BN without decay}])
I think we can separate alphas and weights in this way.
@dragen1860 @yangsenius
Yeah, you get it . @zh583007354
self.alpha_normal = nn.Parameter(torch.randn(k, num_ops))
def filter(model):
for name, param in model.name_parameters():
if 'alpha' in name:
contiue
yield param
optimizer = torch.optim.Adam([ {'weights':filter(model), 'alphas':model.alpha_normal}])
It takes one hour for a epoch to search architecture. However, the paper use "a small network of 8 cells is trained using DARTS for 50 epochs. The search takes one day on a single GPU". If I train 50 epochs. It will take more than two days.
@yangsenius hi, I have another question.
I want to know whether it is necessary of the clip_gradnorm() in train_search.py
loss.backward() nn.utils.clip_grad_norm_(model.parameters(), args.grad_clip) optimizer.step()
If it is necessary to clip the gradient, should it be used for only weight params or all params?
Thank you.
I think the necessity of this clip_grad_norm_()
is unknowable. Because we can't get the gradient range of the parameters, but this should be done to avoid gradient explosions (just in case), although this may not happen. So this code snippet may be useless . @zh583007354
https://github.com/dragen1860/DARTS-PyTorch/blob/cfcdd02cea876ce85940aa01ee58664405390fa7/model_search.py#L217
nn.Parameters() will make the alpha and beta registered to model.parameters(), so your optimizer will update the alpha and beta when optimize the weight of operations. So i think the nn.parameters() should not be used in here, which will be not consistent with the paper or original code.