Why the unconditional predicted noise didn't use referencenet feature? Which may cause a gap between training and inference. When in training, we only dropout the hiddenstates, not the referencenet feature.
However, in practice, we noticed that open classifier-free guidance in referencenetattention has better performance than not, the generated video has better color, can anyone explain it?
Thanks.
Why the unconditional predicted noise didn't use referencenet feature? Which may cause a gap between training and inference. When in training, we only dropout the hiddenstates, not the referencenet feature. However, in practice, we noticed that open classifier-free guidance in referencenetattention has better performance than not, the generated video has better color, can anyone explain it? Thanks.