choasma / HSIC-bottleneck

The HSIC Bottleneck: Deep Learning without Back-Propagation
https://arxiv.org/abs/1908.01580
MIT License
81 stars 17 forks source link

Question on block coordinate descent #10

Closed lizhenstat closed 3 years ago

lizhenstat commented 3 years ago

Hi, thanks for your work on giving out a new way to train networks, however I am confused on the optimization part.

If I understand right, train_hsic.py is the core part in implementing the algorithm, as stated in issue 1 therefore the core optimization code part is as following

output, hiddens = model(data)
params, param_names = misc.get_layer_parameters(model=model, idx_range=idx_range[i]) 
optimizer = optim.SGD(params, lr = config_dict['learning_rate'], momentum=.9, weight_decay=0.001)
optimizer.zero_grad()
# '''
hx_l, hy_l = hsic_objective(hiddens[i],h_target=h_target.float(),h_data=h_data,sigma=config_dict['sigma'])
loss = hx_l - config_dict['lambda_y']*hy_l
loss.backward()
optimizer.step()

As I understand, this part of code correspond to algorithm 1 and loss here corresponds to equation(6). I think "block coordinate descent" means update groups of variables step by step and I did not find the implicit update of parameters here(Can you tell me the code in which this optimizer is defined) I did find the I have two questions related with "unformatted training" (1) why this part does not use backprop? (the following part involves loss.backward() and optimizer.step()) (2) what the "block" here means in block coordinate descent? Is the neurons in the same layer considered as in the same group?

Thanks a lot!

choasma commented 3 years ago

hi @lizhenstat

Thank you for paying attention on our project and the paper. Regarding to the question, I just prepared the notebook (link) for the explanation. Also you can pull the project to run the same script in python (link).

The goal of this is trying to avoid the optimizer to update all the parameters (e.g., weight, bias) across layers of the model with the associated local HSIC-bottleneck loss. Here we have the idea that the optimizer is only being fed by the parameters at particular layer with the function (_misc.get_layerparameters), where the i in idx_range determines the weight&bias at layer i.

This would ensure that all the other parameters, especially for upstream layers are not changed by optimizing the loss at layer i

choasma commented 3 years ago

In the provided ISSUE-0010-BCD scripts, you'll see with this approach only the interested layers are being updated. Others remain the same. The "block" coordinate descent is exactly what you pointed in (2), that we view the neural network as the group of "blocks", where each block is the set of parameter at particular layer. In our standard case, we have weight/bias so you'll see the skip 2 in the code.

For Resnet, we learn HSICbot objective at each residual block, which has more parameters than the standard MLP.