about build_pose_img Function's Output

in your code :

   def build_pose_img(self, img_path):
        string = self.annotation_file.loc[os.path.basename(img_path)]
        array = load_pose_cords_from_strings(string['keypoints_y'], string['keypoints_x'])
        pose_map = torch.tensor(cords_to_map(array, tuple(self.pose_img_size), (256, 176)).transpose(2, 0, 1), dtype=torch.float32)
        pose_img = torch.tensor(draw_pose_from_cords(array, tuple(self.pose_img_size), (256, 176)).transpose(2, 0, 1) / 255., dtype=torch.float32)
        pose_img = torch.cat([pose_img, pose_map], dim=0)
        return pose_img

I am curious about the design choice in the build_pose_img function where it concatenates pose_img and pose_map, resulting in a tensor with 21 channels. My initial expectation was that the function would directly return the pose_img with 3 channels. I am interested in understanding the rationale behind using 21 channels instead.

What is the purpose of concatenating pose_img with pose_map, and how does it benefit the overall model or application?

Another question: what is the difference between these two images(img_src and img_cond)? Which img is used for training？

return_dict = {
            "img_src": img_src,
            "img_tgt": img_tgt,
            "img_cond": img_cond,
            "pose_img_src": pose_img_src,
            "pose_img_tgt": pose_img_tgt
        }

YanzuoLu / CFLD

about build_pose_img Function's Output #11