In your paper, Perception-Refined Decoder uses source image encoder. So, I thought appearance encoder should be used, but in your code you use 'down_block_additional_residuals' which uses pose encoder. Why is it?
def forward(self, batched_inputs):
mask = batched_inputs["mask"] if "mask" in batched_inputs else None
x, features = self.backbone(batched_inputs["img_cond"], mask=mask)
up_block_additional_residuals = self.appearance_encoder(features)
bsz = x.shape[0]
if self.training:
bsz = bsz * 2
down_block_additional_residuals = self.pose_encoder(torch.cat([batched_inputs["pose_img_src"], batched_inputs["pose_img_tgt"]]))
up_block_additional_residuals = {k: torch.cat([v, v]) for k, v in up_block_additional_residuals.items()}
# why self.decoder uses pose_encoder?
c = self.decoder(x, features, down_block_additional_residuals)
In your paper, Perception-Refined Decoder uses source image encoder. So, I thought appearance encoder should be used, but in your code you use 'down_block_additional_residuals' which uses pose encoder. Why is it? def forward(self, batched_inputs): mask = batched_inputs["mask"] if "mask" in batched_inputs else None x, features = self.backbone(batched_inputs["img_cond"], mask=mask) up_block_additional_residuals = self.appearance_encoder(features)