TimRoith / BregmanLearning

Optimizing neural networks via an inverse scale space flow.
MIT License
15 stars 5 forks source link

Hyper parameters needed to reproduce results from 'A Bregman Learning Framework for Sparse Neural Networks' #8

Open sirTomasson opened 6 months ago

sirTomasson commented 6 months ago

Hi there,

I am currently trying to determine the robustness of sparse networks against adversarial examples. For this we are trying to reproduce the results from your paper; specifically the MNIST-Fashion CNN model.

I have trouble reproducing the level of sparsity and accuracy; I am getting 80% accuracy and sparsity levels of around 50%. The paper mentioned that hyperparameters were tuned, but I cannot seem to find the specific parameters. Would you be able to share the hyperparameters that you used to obtain your results?

Any help would be appreciated.

TimRoith commented 6 months ago

Hi,

first of all sorry for not reporting the hyperparameters. The notebooks right now slightly deviate from the state they were in, when the experiments were produced. But one can quickly adjust this:

The important difference is in the weight initialization: To reproduce the experiment, we need to use

def init_weights(conf, model):
    maf.sparse_bias_uniform_(model, 0,conf.r[0])
    maf.sparse_bias_uniform_(model, 0,conf.r[0], ltype=torch.nn.Conv2d)
    maf.sparse_weight_normal_(model, conf.r[1])
    maf.sparse_weight_normal_(model, conf.r[2], ltype=torch.nn.Conv2d)

    maf.sparsify_(model, conf.sparse_init, ltype = nn.Conv2d, conv_group=conf.conv_group)
    maf.sparsify_(model, conf.sparse_init, ltype = nn.Linear)
    model = model.to(conf.device)    
    return model

In the current state of the notebook, we only sparsify the convolutional kernels, in the snippet above we also sparsify the linear layers with maf.sparsify_(model, conf.sparse_init, ltype = nn.Linear), which allows only 1% of the network parameters to be non-zero. We can check this with maf.net_sparsity(model) which should return 0.01.

Performance-wise the relevant hyperparameter is conf.r, which is $r$ in the paper scaling the weight initialization. Weights are usually scaled in order to prevent exploding or vanishing gradients. However, since we zero out a 99% of the weights, we need to make for this by multiplying with $r$ which is chosen per parameter group. The derivation in our paper suggests $r=\sqrt{s_0^{-1}}$ where $s_0$ is the initial amount of weights being used, in our case 0.01 and therefore $r=10$. However, since we randomly mask out weights, there might be other effects in the forward in backward pass, which makes $r$ more of a hyperparameter which might need tuning.

That being said, I just tested out the following configuration (without tuning hyperparmeters): optimizer: LinBreg random seed: 0 lr: 0.07 $\lambda_0=\lambda_1=.5$ $r$: 10

which yields a best model (using train.best_model) with the following stats:

Test Accuracy: 0.88 Convolution kernel sparsity: 0.035 Linear sparsity: 0.033

This result is in line with Table 2 of the paper, and did not use any hyperparameter tuning.

TimRoith commented 6 months ago

I will also push the version of the notebook, used for the example above.