YanzuoLu / CFLD

[CVPR 2024 Highlight] Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis
MIT License
183 stars 12 forks source link

about build_pose_img Function's Output #11

Closed ButoneDream closed 7 months ago

ButoneDream commented 7 months ago

in your code :

   def build_pose_img(self, img_path):
        string = self.annotation_file.loc[os.path.basename(img_path)]
        array = load_pose_cords_from_strings(string['keypoints_y'], string['keypoints_x'])
        pose_map = torch.tensor(cords_to_map(array, tuple(self.pose_img_size), (256, 176)).transpose(2, 0, 1), dtype=torch.float32)
        pose_img = torch.tensor(draw_pose_from_cords(array, tuple(self.pose_img_size), (256, 176)).transpose(2, 0, 1) / 255., dtype=torch.float32)
        pose_img = torch.cat([pose_img, pose_map], dim=0)
        return pose_img

I am curious about the design choice in the build_pose_img function where it concatenates pose_img and pose_map, resulting in a tensor with 21 channels. My initial expectation was that the function would directly return the pose_img with 3 channels. I am interested in understanding the rationale behind using 21 channels instead.

What is the purpose of concatenating pose_img with pose_map, and how does it benefit the overall model or application?

Another question: what is the difference between these two images(img_src and img_cond)? Which img is used for training?

return_dict = {
            "img_src": img_src,
            "img_tgt": img_tgt,
            "img_cond": img_cond,
            "pose_img_src": pose_img_src,
            "pose_img_tgt": pose_img_tgt
        }
YanzuoLu commented 7 months ago

This can be seen as a simple trick to strengthen the pose information. I don't think this is gonna making too much differences but just ease the difficulty of learning pose patterns. And this is a very commonly used strategy and we are also not the first to do so. The previous work PIDM also adopt this as https://github.com/ankanbhunia/PIDM/issues/6. Thanks for your attention to our work : )