First of all, I would like to thank you for sharing your work.
I have read both the paper and the code and I found some parts that I cannot understand so I would like to ask some questions here.
About the heatmap loss, in the Eq. 6 of paper, it is written that $Wa$ is a weight such that the loss around the area that has high correlation to the input condition has higher priority factors.
From my understanding from Figure 2, it seems that the heatmap is used for simple multiplication or simple mask.
Why is it necessary to pass the obtained heatmap to VAE encoder?
Why do you need 1+ in loss_simple = torch.mul(self.get_loss(model_output, target, mean=False),(1+self.pose_loss_weight*back_to_embed_pose_add_weight)).mean([1, 2, 3])
Dear the authors of HumanSD.
First of all, I would like to thank you for sharing your work. I have read both the paper and the code and I found some parts that I cannot understand so I would like to ask some questions here.
About the heatmap loss, in the Eq. 6 of paper, it is written that $Wa$ is a weight such that the loss around the area that has high correlation to the input condition has higher priority factors.
From my understanding from Figure 2, it seems that the heatmap is used for simple multiplication or simple mask.
But, after checking the code, it seems that the obtained heatmap is not directly used as simple mask. After heatmap is obtained, the heatmap is passed to VAE encoder as shown here: https://github.com/IDEA-Research/HumanSD/blob/c5db29dd66a3e40afa8b4bed630f0aa7ea001880/ldm/models/diffusion/ddpm.py#L2011 After that, the obtained embedding is used to mask the loss here: https://github.com/IDEA-Research/HumanSD/blob/c5db29dd66a3e40afa8b4bed630f0aa7ea001880/ldm/models/diffusion/ddpm.py#L2026
My questions are the following:
loss_simple = torch.mul(self.get_loss(model_output, target, mean=False),(1+self.pose_loss_weight*back_to_embed_pose_add_weight)).mean([1, 2, 3])
I would really appreciate it if you could guide me to understand your work more correctly. Thank you very much.