Hi, thanks for the great work!! And I am very interested in this work.
However, I am new to the area of quantization and have some questions about the round_to_fixed function in deepshift.utils Line7-18.
In line15 the torch.floor(input/delta) round the fp32 input to the nearest 16bit interger. In my opinion the clamp function should then be followed to clamp the nearest intergers to range(min_val, max_val), that is changing line15-17 to the following:
_rounded = torch.floor(input/delta)
rounded = torch.clamp(rounded, min_val, maxval)
rounded = rounded*delta
Could you give me some comments about the difference of these two implementations? Thanks!!
Hi, thanks for the great work!! And I am very interested in this work.
However, I am new to the area of quantization and have some questions about the round_to_fixed function in deepshift.utils Line7-18.
In line15 the torch.floor(input/delta) round the fp32 input to the nearest 16bit interger. In my opinion the clamp function should then be followed to clamp the nearest intergers to range(min_val, max_val), that is changing line15-17 to the following: _rounded = torch.floor(input/delta) rounded = torch.clamp(rounded, min_val, maxval) rounded = rounded*delta
Could you give me some comments about the difference of these two implementations? Thanks!!