They are the same image. To retain the spatial structure between the source and the target image, the upper part of this figure is generated based on replacing the self-attention map in layers 4-14.
In the upper part of Figure 4, "No replacement" indicates that there is no replacement of the cross-attention map from the source image; however, replacement of the self-attention map is included in order to preserve the spatial structure. In the lower part of Figure 4,"No replacement"="Direct Generation".
I've two questions:
Two subfigure (As marked in blue boxes) in Figure 4 seems exactly identical, why it happens
What's the differences in algorithm between "No replacement" and "Direct Generation"
I've two questions:
Thanks