Closed xizaoqu closed 2 weeks ago
Besides, I noticed that in the implementation, the optimization process is divided into 4 parts to do it separately. Is it out of the consideration of saving computing memories? Will it influence the results? (i.e. outputs some discontinuous artifacts in the boundary)
Hi, thanks for your interest in our project. The uploaded code uses Posterior Sampling, corresponding to Eq. 14. Eq. 13 pertains to Directly Guided Sampling, and we will upload the code for this mode in a few days.
Yes, the patch division is implemented to save computing memory, as we used a GPU with 40GB. This should not significantly affect the results. During patch division, we ensured each patch overlapped to eliminate discontinuities.
Hi, thanks for your interest in our project. The uploaded code uses Posterior Sampling, corresponding to Eq. 14. Eq. 13 pertains to Directly Guided Sampling, and we will upload the code for this mode in a few days. The variable
temp_cond_latents
is represented as μ~t in both Eq. 13 and Eq. 14.Thanks for your reply. I made a typo in question 1 that Eq. 13 should be Eq. 12. But in the code
temp_cond_latents
is directly encoded by vae, not by Eq. 12?
oh, i misunderstood your question.
In Equation 12, the $\hat{\mathbf{X}}$ corresponds to the warped image. And the temp_cond_latents
represents the VAE encoded warped images. The other term in Equation 12, $\mu$ represents the predictions from the diffusion model. In Equation 12, $\widetilde\mu$ is a combination of $\hat{\mathbf{X}}$ and $\mu$ , which means that in some regions of the image, the pixels come from $\hat{\mathbf{X}}$ (the warped image), while in other regions, the pixels are directly generated from the diffusion model. In latent space, this means, in some regions of the image, the feature come from temp_cond_latents
and other generated from diffusion model.
In the optimization part, i.e., Equation 14, the loss is computed between $\widetilde\mu$ and the target $\mu$. Since $\widetilde\mu$ is a combination of $\hat{\mathbf{X}}$ and $\mu$, in the implementation part, the loss is only computed on the regions where the warped image (temp_cond_latents
in latent space) is used, and the other regions that are directly generated from the diffusion model are not included in the loss computation.
In the step_single function, the variable λ in Equation 12 is used to compute the mask that represents the regions (top_masks
) where the VAE encoding (temp_cond_latents
) is used.
Hope this makes it clearer.
Understood, thanks.
Hi, thanks for sharing your great work. I have a little question about the implementation of the sampling part. Is the
temp_cond_latents
instep_in
corresponding to the \mu in eqn. 13.