Open XA23i opened 3 weeks ago
Thank you very much for your questions. For the first question, you are right this is a typo. We will change the "lowest" to "highest" in the revision.
For the second question, the hard rounding function $\sigma'(\nu)$ returns 0 or 1, and here we subtract them by 0.5 so that the values for merging become either $0.5s$ or $-0.5s$. These values will not change the RTN (rounding-to-nearest) if your learned rounding is the same as the original RTN, but only change the variables that learned flipped rounding.
Thank you for your response! I am still confused about the second question. According to this formula,the learnable factors \alpha is added besides existing quantization process and optimize through progressive adaptive rounding. In dequantization, why not simply update \theta = \theta + s * \alpha .
Note that in Eq.4 we use the floor operation. But in the end, we wanna change it back to the standard round operation as in Eq. 1. So we need to subtract them by 0.5
I got it, thank you!
Hi, this is a novel idea and paper. I have some questions below.
Why select the lowest HS to be Shard? I think the highest HS means they are close to 0 or 1, and it is natural to convert them to be S hard, which do not need optimization any more.
By the way, after optimization, why we need to minus 0.5 here.