In the PiGDM paper (Sec A.1, Algorithm 1) it says that we need to scale the guidance term by $\sqrt{\alpha_t}$.
In the code we scale by $\sqrt{\alphat} \cdot \sqrt{\alpha{t-1}}$:
If we only scale by $\sqrt{\alphat}$ we get NaN during inference due to large guidance.
From were this additional scaling by $\sqrt{\alpha{t-1}}$ comes from?
In the PiGDM paper (Sec A.1, Algorithm 1) it says that we need to scale the guidance term by $\sqrt{\alpha_t}$. In the code we scale by $\sqrt{\alphat} \cdot \sqrt{\alpha{t-1}}$:
If we only scale by $\sqrt{\alphat}$ we get
NaN
during inference due to large guidance. From were this additional scaling by $\sqrt{\alpha{t-1}}$ comes from?