IBM / aihwkit

IBM Analog Hardware Acceleration Kit
https://aihwkit.readthedocs.io
MIT License
363 stars 147 forks source link

Model Initialized outsize [w_min, w_max]: Pinpointing the bug in issue #604 #635

Open Zhaoxian-Wu opened 7 months ago

Zhaoxian-Wu commented 7 months ago

Description

As discussed in #604, the model weights will sometimes fall outsize [w_min, w_max].

Bug Pinpoint

The bug happens because of the incorrect initialization of w_min_bound_ and w_max_bound_ (see the code). It seems the following code snippet is designed to deal with the situation where the share weights is deployed and PulsedDevice.perfect_bias is turned on. When the PulsedDevice.perfect_bias is on, the last dimension of the weights is incorrectly amplified by 100 times, yielding the incorrect active regions and weights.

// perfect bias
if ((par.perfect_bias) && (j == this->x_size_ - 1)) {
  w_scale_up_[i][j] = par.dw_min;
  w_scale_down_[i][j] = par.dw_min;
  w_min_bound_[i][j] = (T)100. * par.w_min; // essentially no bound
  w_max_bound_[i][j] = (T)100. * par.w_max; // essentially no bound
}

TODO

I was trying to fit the bug directly, but I found that I couldn't control shared_weight through the AnalogLinear initialization. I guess we should design a flag here to better control the shared weight behavior.

maljoras commented 7 months ago

Thanks for raising this issue. perfect_bias is indeed some "old" parameter setting, that should only be relevant for analog_bias. Since we have digital_bias now, it should actually be deleted. It has nothing to do with shared_weights so this is not relevant here.

Zhaoxian-Wu commented 7 months ago

I see. It seems that using digital_bias instead is a more natural solution. But what does shared_weights do? Does that mean multiple tiles share the same torch array?

maljoras commented 7 months ago

Shared weights is saying that the memory to the tile is handled from torch (and not from within C++). This means that also the backward etc is handled by torch. Note that the RPUCuda library is capable of handling the memory of the tile arrays and data internally (as it is a independent library that can be also used independently of pytorch)

Zhaoxian-Wu commented 7 months ago

Shared weights is saying that the memory to the tile is handled from torch (and not from within C++). This means that also the backward etc is handled by torch. Note that the RPUCuda library is capable of handling the memory of the tile arrays and data internally (as it is a independent library that can be also used independently of pytorch)

I see. Thanks for your kind explaination :D

kaoutar55 commented 6 months ago

@maljoras do we need to remove the perfect_bias in the code flow when we are using digital bias? what do you suggest here? It looks that we have a bug we need to solve.

Borjagodoy commented 3 months ago

I think this could be moved to a new issue @kaoutar55 , since the issue was opened because of a problem that finally seemed to be a concept bug, we can open a discussion about the perfect_bias if you like @maljoras and close this issue because actually the issue was solved, or at least that was my impression correct me if I'm wrong @Zhaoxian-Wu

kaoutar55 commented 2 months ago

@Zhaoxian-Wu please look at this and try it at your end with the suggested changes. Let us know if the issue is resolved.