IBM / aihwkit

IBM Analog Hardware Acceleration Kit
https://aihwkit.readthedocs.io
Apache License 2.0
355 stars 145 forks source link

Weight initialization #106

Closed chaeunl closed 3 years ago

chaeunl commented 3 years ago

Description

There are two issues on the weight initialization: 1) It seems that aihwkit follows He's initialization, but the max/min bound is not correct 2) For some memory devices, they have their own bounds and the bounds could be smaller than the bounds of Xavier's initialization.

How to reproduce

1) From [1], . For instance, aihwkit.nn.AnalogLinear has 256 fan-ins and 128 fan-outs, then the range of weight should be -0.12 to 0.12. But, as you can see underneath, the range does not match (instead, half of the max/min value). cap So, I think it would be better to modify the allowed range of weights or you can also support various type of initialization methods.

2) The underneath figure shows the response curve of a aihwkit.simulator.configs.devices.LinearStepDevice whose slope_up and slope_down are 0.0083. image As you might expect, if the number of neurons increases, then the allowed range of a device would be match the range of initial weights. Although it is rare, somebody will also report this issue in the future. I don't have concrete ideas to this issue, but it is a good alternative if people can modify the initial weights easily.

[1] Kaiming He, et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, 2015.

Expected behavior

Other information

diego-plan9 commented 3 years ago

Thanks for the detailed report @chaeunl ! Seems a good candidate for an accurate answer from @maljoras.

maljoras commented 3 years ago

Thank you for the comment. Note that the weight initialization is independent from the weight range setting. It is up to the user to choose the correct RPU weight range that fits to the initialization (which is taken from pytorch standard inits) . Decoupling these two is necessary since the weights might need to grow during training and become larger from the initial values set by the weight init.

chaeunl commented 3 years ago

Thank you, @maljoras and @diego-plan9 . I confirmed that weights and biases are initialized by Xavier's rule.

chaeunl commented 3 years ago

In addition, I am confused about the weights on layers. If I use the TransferCompound with unit_cell_devices as [A1, A2, ..., An] , then which parameters does analog_tile.get_weights() read? I guessed that analog_tile.get_weights() return the tensor of weights in nn.Linear() module, not nn.AnalogLinear() module. Or which parameters does analog_tile.get_hidden_parameters()["hidden_weights_k"] read? I confirmed that the initial weights by analog_tile.get_hidden_parameters()["hidden_weights_k"] are all zeros.

So, if I want to use these types of devices which has multiple device in a RPU, then analog_tile.set_hidden_parameters() is the only way to set the weights and biases? or are you planning to introduce other methods?

My question is that to read and write the weights directly, which function is best for this case?

maljoras commented 3 years ago

The get_weights method returns the "effective" weights, that is the weight which is used for the forward and backward pass. For devices with multiple unit cell devices that means the reduced weight depending on the gamma weightening of each unit cell device (see gamma and gamma_vec parameters). The initial weight init is per default same weights for all hidden weights, however, it might depend on other settings, too. How to distribute the weight matrix onto the hidden devices is implemented in the onSetWeights method of the rpu device (see eg rpu_vector_device.cpp). For the transfer compound (see rpu_transfer_device.cpp) only the non hidden device (the second one) will be set, whereas the fast device (the first one) is set to all zeros. Note that the second hidden in case fully hidden transfer compound (ie gamma=0) is not used internally and thus irrelevant (it is always the weight matrix you get with get weights). This may be the cause of your confusion. It is a special case to optimize memory usage.

If you want to set the hidden devices by yourself it should be possible to first instantiate the tile, use get_hidden_parameters set the hidden_weigts_* and use set_hidden_parameters again. However, in general, the initial weight setting should not be very critical for the SGD learning (as long as the weight values interval is approximately correct) as the training will modify the weights as appropriate. Also note that in case of the "fully hidden" transfer compound, the last device is always directly set with set_weights (and others set to all zero). Note sure whether you would need more flexibility here.

Note that setting realistic=True would use the update behavior to program the weights, which is very different from the weight setting described above.

maljoras commented 3 years ago

The short answer is, if you want to read all the hidden weights, indeed you would just use get_hidden_weights. Note that this will copy a lot of memory and you do not want to do this too often during training, but only once at the end of the epoch for example.

chaeunl commented 3 years ago

@maljoras, Thank you for detailed explanations. Following your comment, I tried to read the weights using get_hidden_parameters()["hidden_weights_x"], but the read weights by this approach are always zero. I compared weights to read by three different ways. Please refer to the beneath codes:

import numpy as np
import torch
from aihwkit.nn import AnalogLinear
from aihwkit.simulator.configs import SingleRPUConfig, UnitCellRPUConfig
from aihwkit.simulator.configs.devices import LinearStepDevice, TransferCompound, SoftBoundDevice

in_size, out_size = 4, 3

TransferCompoundwLin_layer = AnalogLinear(in_features=in_size,
                                          out_features=out_size,
                                          bias=False,
                                          rpu_config=UnitCellRPUConfig(
                                              device=TransferCompound(
                                                  unit_cell_devices=[
                                                      LinearStepDevice(),
                                                      LinearStepDevice()],
                                                  units_in_mbatch=True,    
                                                  transfer_every=1,       
                                                  n_cols_per_transfer=1,  
                                                  gamma=0.0,              
                                                  scale_transfer_lr=True,  
                                                  transfer_lr=2.0,)))

TransferCompoundwSoft_layer = AnalogLinear(in_features=in_size,
                                           out_features=out_size,
                                           bias=False,
                                           rpu_config=UnitCellRPUConfig(
                                               device=TransferCompound(
                                                   unit_cell_devices=[
                                                       SoftBoundsDevice(),
                                                       SoftBoundsDevice()],
                                                   units_in_mbatch=True,    
                                                   transfer_every=1,       
                                                   n_cols_per_transfer=1,  
                                                   gamma=0.0,              
                                                   scale_transfer_lr=True,  
                                                   transfer_lr=2.0,)))

LinearStepDevice_layer = AnalogLinear(in_features=in_size,
                                      out_features=out_size,
                                      bias=False,
                                      rpu_config=SingleRPUConfig(
                                          device=LinearStepDevice()))

def init_weight_test(tile):
    in_size, out_size = tile.in_size, tile.out_size

    hidden_params = tile.get_hidden_parameters()
    revealed_weights = tile.get_weights(realistic=False)[0]
    forwarded_weights = tile.forward(torch.eye(in_size, in_size)).T

    print("Hidden Parameters:"+"\n   ", hidden_params.keys(), "\n"+"\n")

    idx = 0
    print("{}) ".format(idx)+"Read Weights by get_weight(realistic=False):"+"\n   ", revealed_weights, "\n")
    idx += 1
    print("{}) ".format(idx)+"Read Weights by forward():"+"\n   ", forwarded_weights, "\n")

    for k in hidden_params.keys():
        if "hidden_weights" in k:
            idx += 1
            hidden_weights = hidden_params[k]
            print("{}) ".format(idx)+"Read Weights by get_hidden_params[\"{}\"]:".format(k)+"\n   ", hidden_weights, "\n")

tile = LinearStepDevice_layer.analog_tile
init_weight_test(tile)

tile = TransferCompoundwLin_layer.analog_tile
init_weight_test(tile)

tile = TransferCompoundwSoft_layer.analog_tile
init_weight_test(tile)

The results show that the stored weights are not properly read by get_hidden_parameters()["hidden_weights_x"]:

Hidden Parameters:
    odict_keys(['max_bound', 'min_bound', 'dwmin_up', 'dwmin_down', 'decay_scales', 'diffusion_rates', 'reset_bias', 'slope_up', 'slope_down']) 

0) Read Weights by get_weight(realistic=False):
    tensor([[ 0.4660,  0.4214,  0.1626, -0.3317],
        [-0.0166, -0.3804, -0.1181,  0.1826],
        [-0.2635,  0.0426, -0.3509,  0.3933]]) 

1) Read Weights by forward():
    tensor([[ 0.4706,  0.4706,  0.1412, -0.4235],
        [-0.0941, -0.3765, -0.0471,  0.2353],
        [-0.2353, -0.0000, -0.4235,  0.3765]]) 

Hidden Parameters:
    odict_keys(['max_bound_0', 'min_bound_0', 'dwmin_up_0', 'dwmin_down_0', 'decay_scales_0', 'diffusion_rates_0', 'reset_bias_0', 'slope_up_0', 'slope_down_0', 'hidden_weights_0', 'max_bound_1', 'min_bound_1', 'dwmin_up_1', 'dwmin_down_1', 'decay_scales_1', 'diffusion_rates_1', 'reset_bias_1', 'slope_up_1', 'slope_down_1', 'hidden_weights_1']) 

0) Read Weights by get_weight(realistic=False):
    tensor([[-0.0508, -0.4979,  0.3187,  0.3275],
        [-0.3344, -0.3081,  0.0864, -0.2015],
        [ 0.2276, -0.3104,  0.1624, -0.1372]]) 

1) Read Weights by forward():
    tensor([[-0.0941, -0.4235,  0.4706,  0.3765],
        [-0.4706, -0.3294,  0.1412, -0.1882],
        [ 0.2353, -0.2824,  0.1412, -0.2353]]) 

2) Read Weights by get_hidden_params["hidden_weights_0"]:
    tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]]) 

3) Read Weights by get_hidden_params["hidden_weights_1"]:
    tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]]) 

Hidden Parameters:
    odict_keys(['max_bound_0', 'min_bound_0', 'dwmin_up_0', 'dwmin_down_0', 'decay_scales_0', 'diffusion_rates_0', 'reset_bias_0', 'slope_up_0', 'slope_down_0', 'hidden_weights_0', 'max_bound_1', 'min_bound_1', 'dwmin_up_1', 'dwmin_down_1', 'decay_scales_1', 'diffusion_rates_1', 'reset_bias_1', 'slope_up_1', 'slope_down_1', 'hidden_weights_1']) 

0) Read Weights by get_weight(realistic=False):
    tensor([[ 0.3489,  0.1591, -0.3749, -0.4784],
        [ 0.0776,  0.2908,  0.1403,  0.1401],
        [ 0.2895,  0.1349,  0.2243, -0.1773]]) 

1) Read Weights by forward():
    tensor([[ 0.3765,  0.2824, -0.3765, -0.4706],
        [ 0.1412,  0.3294,  0.1882,  0.1412],
        [ 0.2353,  0.0941,  0.2824, -0.1882]]) 

2) Read Weights by get_hidden_params["hidden_weights_0"]:
    tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]]) 

3) Read Weights by get_hidden_params["hidden_weights_1"]:
    tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]]) 

So, I want to know what's going wrong.

One more question: as you can see, the Read Weights by forward() don't match Read Weights by get_weight(realistic=False). Even though I set tile.forward.is_perfect=True, the result doesn't match perfectly. How could I turn off non-ideal factors?

maljoras commented 3 years ago

Hi @chaeunl, thanks for the detailed experiments!

Regarding weight being different
What you are seeing in the forward pass is the noisy MAC. You haven't set the forward to perfect in the above example, and you cannot just set tile.forward.is_perfect once the tile is constructed. All parameter need to be set before construction of the tile (ie before creating the layer). You need to give the is_perfect parameter in the RPUConfig, then the weights will be identical.

For instance:

rpu_config=UnitCellRPUConfig(
                             device=TransferCompound(
                                          unit_cell_devices=[
                                             LinearStepDevice(),
                                             LinearStepDevice()],
                                          units_in_mbatch=True,
                                          transfer_every=1,
                                          n_cols_per_transfer=1,
                                          gamma=0.0,
                                          scale_transfer_lr=True,
                                          transfer_lr=2.0,),
                             forward=IOParameters(is_perfect=True))

Before one creates the tile but not after creation, one can conveniently modify the fields of the rpu_config, for instance

 rpu_config.backward.is_perfect = True

would set the backward pass to perfect as well, when a tile is created using this rpu_config. After the construction of the tile/layer, these parameters are fixed and cannot be changed as they define the constructed tile (simply re-construct a tile/layer with modified rpu_config is the way to go).

As seen in the modified example below, weights are indeed exactly the same as expected, if the forward pass is set to perfect.

Regarding hidden weights being all 0 You are using the special case of a transfer compound with gamma=0. In this case, the second weight matrix (C) is always just the one you get with get_weights. The hidden_weights_1 are always set to zero and ignored. See my explanation in https://github.com/IBM/aihwkit/issues/106#issuecomment-750293021

The first hidden weights hidden_weights_0 reflects the A matrix and is set to zero at the beginning. You need to update a couple of cycles. Then you will see that they differ from zero. See the example below.

If you set gamma to something not equal to zero, then both hidden weights are used (in other words hidden_weights_1 corresponds to C), and the get_weights weight W is given by the convex combination of the two, W = gamma*A + (1-gamma)*C.

Modified example

import numpy as np
import torch
from aihwkit.nn import AnalogLinear
from aihwkit.simulator.configs import SingleRPUConfig, UnitCellRPUConfig
from aihwkit.simulator.configs.devices import LinearStepDevice, TransferCompound, SoftBoundsDevice
from aihwkit.simulator.configs.utils import IOParameters

in_size, out_size = 4, 3

TransferCompoundwLin_layer = AnalogLinear(in_features=in_size,
                                          out_features=out_size,
                                          bias=False,
                                          rpu_config=UnitCellRPUConfig(
                                              device=TransferCompound(
                                                  unit_cell_devices=[
                                                      LinearStepDevice(),
                                                      LinearStepDevice()],
                                                  units_in_mbatch=True,
                                                  transfer_every=1,
                                                  n_cols_per_transfer=1,
                                                  gamma=0.0,
                                                  scale_transfer_lr=True,
                                                  transfer_lr=2.0,),
                                              forward=IOParameters(is_perfect=True))
                                          )

TransferCompoundwSoft_layer = AnalogLinear(in_features=in_size,
                                           out_features=out_size,
                                           bias=False,
                                           rpu_config=UnitCellRPUConfig(
                                               device=TransferCompound(
                                                   unit_cell_devices=[
                                                       SoftBoundsDevice(),
                                                       SoftBoundsDevice()],
                                                   units_in_mbatch=True,
                                                   transfer_every=1,
                                                   n_cols_per_transfer=1,
                                                   gamma=0.0,
                                                   scale_transfer_lr=True,
                                                   transfer_lr=2.0,),
                                               forward=IOParameters(is_perfect=True))
                                           )

LinearStepDevice_layer = AnalogLinear(in_features=in_size,
                                      out_features=out_size,
                                      bias=False,
                                      rpu_config=SingleRPUConfig(
                                          device=LinearStepDevice(),
                                          forward=IOParameters(is_perfect=True))
                                      )

def init_weight_test(tile):
    in_size, out_size = tile.in_size, tile.out_size

    # do some random update to modify the first hidden weights
    tile.update(torch.randn((1,in_size)), torch.randn((1,out_size)))

    hidden_params = tile.get_hidden_parameters()
    revealed_weights = tile.get_weights(realistic=False)[0]
    forwarded_weights = tile.forward(torch.eye(in_size, in_size)).T

    print("Hidden Parameters:"+"\n   ", hidden_params.keys(), "\n"+"\n")

    idx = 0
    print("{}) ".format(idx)+"Read Weights by get_weight(realistic=False):"+"\n   ", revealed_weights, "\n")
    idx += 1
    print("{}) ".format(idx)+"Read Weights by forward():"+"\n   ", forwarded_weights, "\n")

    for k in hidden_params.keys():
        if "hidden_weights" in k:
            idx += 1
            hidden_weights = hidden_params[k]
            print("{}) ".format(idx)+"Read Weights by get_hidden_params[\"{}\"]:".format(k)+"\n   ", hidden_weights, "\n")

tile = LinearStepDevice_layer.analog_tile
init_weight_test(tile)

tile = TransferCompoundwLin_layer.analog_tile
init_weight_test(tile)

tile = TransferCompoundwSoft_layer.analog_tile
init_weight_test(tile)

The output is:

Hidden Parameters:
    odict_keys(['max_bound', 'min_bound', 'dwmin_up', 'dwmin_down', 'decay_scales', 'diffusion_rates', 'reset_bias', 'slope_up', 'slope_down']) 

0) Read Weights by get_weight(realistic=False):
    tensor([[ 0.0418, -0.4964, -0.1204, -0.3613],
        [ 0.4299,  0.1323,  0.1962, -0.4555],
        [-0.2605, -0.0267, -0.4995,  0.0778]]) 

1) Read Weights by forward():
    tensor([[ 0.0418, -0.4964, -0.1204, -0.3613],
        [ 0.4299,  0.1323,  0.1962, -0.4555],
        [-0.2605, -0.0267, -0.4995,  0.0778]]) 

Hidden Parameters:
    odict_keys(['max_bound_0', 'min_bound_0', 'dwmin_up_0', 'dwmin_down_0', 'decay_scales_0', 'diffusion_rates_0', 'reset_bias_0', 'slope_up_0', 'slope_down_0', 'hidden_weights_0', 'max_bound_1', 'min_bound_1', 'dwmin_up_1', 'dwmin_down_1', 'decay_scales_1', 'diffusion_rates_1', 'reset_bias_1', 'slope_up_1', 'slope_down_1', 'hidden_weights_1']) 

0) Read Weights by get_weight(realistic=False):
    tensor([[-0.3611, -0.4577,  0.0604,  0.0533],
        [ 0.2911, -0.3750,  0.0140, -0.0476],
        [-0.3215, -0.1062, -0.2283, -0.0879]]) 

1) Read Weights by forward():
    tensor([[-0.3611, -0.4577,  0.0604,  0.0533],
        [ 0.2911, -0.3750,  0.0140, -0.0476],
        [-0.3215, -0.1062, -0.2283, -0.0879]]) 

2) Read Weights by get_hidden_params["hidden_weights_0"]:
    tensor([[-0.0205, -0.0079,  0.0009, -0.0197],
        [ 0.0196,  0.0070, -0.0022,  0.0190],
        [ 0.0015,  0.0000,  0.0000,  0.0026]]) 

3) Read Weights by get_hidden_params["hidden_weights_1"]:
    tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]]) 

Hidden Parameters:
    odict_keys(['max_bound_0', 'min_bound_0', 'dwmin_up_0', 'dwmin_down_0', 'decay_scales_0', 'diffusion_rates_0', 'reset_bias_0', 'slope_up_0', 'slope_down_0', 'hidden_weights_0', 'max_bound_1', 'min_bound_1', 'dwmin_up_1', 'dwmin_down_1', 'decay_scales_1', 'diffusion_rates_1', 'reset_bias_1', 'slope_up_1', 'slope_down_1', 'hidden_weights_1']) 

0) Read Weights by get_weight(realistic=False):
    tensor([[-0.2067, -0.3304, -0.3501, -0.0262],
        [-0.0615, -0.1146,  0.4494, -0.1935],
        [ 0.3707,  0.3082, -0.1360, -0.0947]]) 

1) Read Weights by forward():
    tensor([[-0.2067, -0.3304, -0.3501, -0.0262],
        [-0.0615, -0.1146,  0.4494, -0.1935],
        [ 0.3707,  0.3082, -0.1360, -0.0947]]) 

2) Read Weights by get_hidden_params["hidden_weights_0"]:
    tensor([[ 0.0197,  0.0162,  0.0190,  0.0069],
        [-0.0167, -0.0091, -0.0175, -0.0075],
        [ 0.0127,  0.0058,  0.0076,  0.0060]]) 

3) Read Weights by get_hidden_params["hidden_weights_1"]:
    tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])
chaeunl commented 3 years ago

@maljoras , Thank you! I misunderstood about get_hidden_parameters["hidden_weights_1"]. I confirmed that when I set gamma to 1.0, then the get_hidden_parameters["hidden_weights_1"] equals to get_weight(realistic=False) and forward() (in the case of perfect).