Why do Binary Neural Networks have poor inference accuracy.

dingandrew commented 2 years ago

Description

Typically when we convert a pretrained model into its analog equivalent the accuracy drops less than 5%. Of course this accuracy drop will be dependent of the RPU configuration and factors such as ADC resolution. However binary Neural Networks have very high accuracy drops which is surprising. Binary neural networks have weights that are either -1 or 1, so a RRAM device would only need a resolution of 1 bit. The positive value crossbar array will hold 1s and 0s since the negative crossbar array will handle -1 values.

BNNs should be very insensitive to device non idealities and require less hardware resources, however it appears that BNNs actually require more ADC and DAC resolution for a greater drop in accuracy.

How to reproduce

# Device used in the RPU tile
mapping = MappingParameter(max_input_size=128,  # analog ti le size
                           max_output_size=128,
                           digital_bias=True,
                           weight_scaling_omega=0.6)  

rpu_config =  InferenceRPUConfig(mapping=mapping)
rpu_config.forward.inp_res = 1/.2**7 - 2 # 7-bit DAC discretization.
rpu_config.forward.out_res = 1/2**9 -2. # 8-bit ADC discretization.
rpu_config.forward.w_noise_type = WeightNoiseType.ADDITIVE_CONSTANT
rpu_config.forward.w_noise = 0.02   # Some short-term w-noise.
rpu_config.forward.out_noise = 0.02 # Some output noise.

# specify the noise model to be used for inference only
rpu_config.noise_model = PCMLikeNoiseModel(g_max=25.0) # the model described

# specify the drift compensation
rpu_config.drift_compensation = GlobalDriftCompensation()

model_analog = convert_to_analog(model_binary.to(device), rpu_config, realistic_read_write=False)

# results = analog_info.analog_summary(model_analog, (128, 3,32,32)) # only works with cpu
print(rpu_config)
print(model_analog)

The pretrained binary model will experience an 20% drop in accuracy, whereas a regular pretrained Resnet-18 model using the same configuration above will only lose 3% of accuracy.

Expected behavior

I would expect binary neural networks to have better accuracy because the weights are either -1 or 1, so it is much easier for the RRAM devices to represent them,

maljoras commented 2 years ago

Hi @dingandrew ,

many thanks for sharing your observation! I don't think that this is actually a "bug", but it might instead be the correct observation. It is generally misleading to directly compare heavily quantized networks with analog networks. The crucial difference is that analog computation is non-deterministic. If weights are binary they can give raise to "counting codes" that is, e.g. if an output is exactly n, assume class X. These could be very unreliable in case of noise as their signal to noise ratio might be very bad. This observation and topic might be very interesting for a research paper if expanded upon.

Also, since many weights are now at gmax, the amount of current might be very high. Note, that there is an output clipping threshold assumed (out_bound, although its effect is smaller in case bound management is on).

Finally, if you want to map -1, 1 to gmax initially you should sets weight_scaling_omega to 1 and ideally set rpu_config.mapping.learn_out_scaling_alpha=True (see here) for an improved HWA training. If one assumes a column wise digital scale one should additionally set rpu_config.mapping.weight_scaling_omega_columnwise (see here). This might improve the accuracy. However, I suspect that it will stay below the non-quantized DNN (especially if you also enable these feature for that one).

dingandrew commented 2 years ago

Hi @maljoras,

Thank you for your response it has been very helpful. I have set weight_scaling_omega to 1, and turned on rpu_config.mapping.weight_scaling_omega_columnwise. The accuracy of the BNN is very much improved.

However, could you explain to me what exactly these two parameters are doing. I have seen that many examples have set weight_scaling_omega to 0.6, and I am not sure why 0.6 or 1 would be better or worse for full precision or quantized networks.

Could you also share some insight on how AIHwKit is mapping full precision or quantized weights to a crossbar array. I would assume that with full precision weights, the rram devices wont have enough resolution. I have tried going through the source code, and found the method convert_to_conductances but I don't think this method is actually being used.

Best,

Andrew

maljoras commented 2 years ago

Hi @dingandrew, The weight scaling just divides the weight matrix by the abs weight max and sets it to omega. So the weight max will be set to 0.6 if weight scaling omega is 0.6 and output scales will be used to get the correct overall weight matrix.

In your case, where you have -1,1 binary weights already, you do not need it actually (you could just set the weight scaling omega to 0.0 to turn it off for binary weights). But make sure that you set w_min to -1.0 and w_max to 1.0. Note that there is also a weight max variability, which you might want to set to w_min/max_dtod=0.0.

In fact, just constructing the analog model by conversion does not yet simulate the programming of the weights to the conductances. This only happens once you call model.program_analog_weights() and/or model.drift_analog_weights(t_inference) where e.g. t_inference=3600. would add drift for one hour after programming (assuming PCM devices).

When reporting the accuracy for inference, you should always program and/or drift the weights. In this case the conductance converters are used. This is important because there is no write noise when evaluating without it. Your binary DNN will perform better than in reality without it.

dingandrew commented 2 years ago

Hi @maljoras,

Wow! Thank you for all your help, the documentation and source code makes much more sense now :). I had been using the method convert_to_analog() to convert a pre-trained model, and only playing with rpu_config.forward parameters and thus only using one type of non-linearity.

Now I am using this method to actually program the second non-linearity.

# recursive for deeply nested models
def program_model_conductance(module: nn.Module, t_inference: int = 3600 ) -> None:
    if list(module.children()) == []:
        if isinstance(module, AnalogModuleBase):
            print("analog: ", module)
            module.eval()
            module.program_analog_weights()
            module.drift_analog_weights(t_inference)
        else:
            print("digital: ", module)
    else:
        for _, layer in module.named_children():
            program_model_conductance(layer)

program_model_conductance(model_analog)

Now, everything make more sense, and many more parameters to play with.

I have a quick question about g_max. Is there an upper bound on this parameter? If this value is very large the resistance of the pcm device would approach 0 essentially causing a short circuit. But surely there are physical limitations as even the low resistance state of the memristor still has some resistance.

maljoras commented 2 years ago

Hi @dingandrew , g_max is typically on the order of 5-25 micro Siemens (at least for PCM). It depends on the material and technology though.

dingandrew commented 2 years ago

Thank You Very Much @maljoras !

IBM / aihwkit