Closed zhaobozb closed 3 years ago
You are correct. During training we compare the target image Y with the output pSp(X). You are correct that it may seem weird to use L2, but it is not harmful for multi-modal synthetsis. With multi-modal synthesis, which occurs only at inference time, we simply change up the fine-layer latents and then get a similar looking image but with different color schemes. Since we can do this with any randomly drawn sample, we can generate endless different outputs for any input sketch/segmentation map. In other words, the L2 loss during training is really used to get the right facial structure (smile, eyes, hair). Then, the style mixing during inference is used to get the color variations. Let me know if this answers your question :smile:
Hi Yuval, thank you very much for your response. So the multi-modal mainly happens with different color variations, such as skin color, hair color and background color?
Exactly.
Thank you very much for your clarification. By the way, I noticed that the LPIPS loss can be negative during training, is that normal?
Hmm. Seems weird that the LPIPS loss is negative. I dont remember seeing something like this in my experiments. Do you have any logs with this happening?
Never mind, I met this negative LPIPS loss on my own dataset. Will investigate it. Thank you very much.
For the conditional image synthesis task, do you use the input conditioned image X and generated image pSp(X) to compute the L2 and LPIPS loss as shown in Equation (1) and (2)? According to the code, seems you use the target image Y and generated image pSp(X) to compute the losses. If so, will the L2 loss be harmful to the multi-modal sythesis?