Intelligent-Computing-Lab-Yale / TesseraQ

Apache License 2.0
16 stars 2 forks source link

Some questions about paper #2

Open XA23i opened 3 weeks ago

XA23i commented 3 weeks ago

Hi, this is a novel idea and paper. I have some questions below.

QQ_1730710307411 Why select the lowest HS to be Shard? I think the highest HS means they are close to 0 or 1, and it is natural to convert them to be S hard, which do not need optimization any more.

QQ_1730710440663 By the way, after optimization, why we need to minus 0.5 here.

yhhhli commented 3 weeks ago

Thank you very much for your questions. For the first question, you are right this is a typo. We will change the "lowest" to "highest" in the revision.

For the second question, the hard rounding function $\sigma'(\nu)$ returns 0 or 1, and here we subtract them by 0.5 so that the values for merging become either $0.5s$ or $-0.5s$. These values will not change the RTN (rounding-to-nearest) if your learned rounding is the same as the original RTN, but only change the variables that learned flipped rounding.

XA23i commented 3 weeks ago

Thank you for your response! I am still confused about the second question. QQ_1730823616824 According to this formula,the learnable factors \alpha is added besides existing quantization process and optimize through progressive adaptive rounding. In dequantization, why not simply update \theta = \theta + s * \alpha .

yhhhli commented 3 weeks ago

Note that in Eq.4 we use the floor operation. But in the end, we wanna change it back to the standard round operation as in Eq. 1. So we need to subtract them by 0.5

XA23i commented 3 weeks ago

I got it, thank you!