chrisdonahue / sdgan

Official implementation of "Semantically Decomposing the Latent Spaces of Generative Adversarial Networks"
MIT License
95 stars 19 forks source link

How is the siamese architecture implemented? #3

Closed jmgrn61 closed 5 years ago

jmgrn61 commented 5 years ago

Thanks for this great work!

I have been studying the original paper and also the source codes, and I have some conceptual question about the siamese setup.

To my understanding, a typical siamese network generates individual feature vectors from two input images (with shared weights of this network), and it has to penalize the distance between the two generated feature vectors by applying a "siamese loss". In fact, FaceNet follows the similar idea because it deals with such a distance.

However, I did not see any "distance like" constraints in sdgan. Instead, according to this paper: "To adapt DCGAN, we stack the feature maps De(x1) and De(x2) along the channel axis, applying one additional strided convolution. This allows the network to further aggregate information from 2 images..."

When I first saw the word "Siamese" in this paper, I was expecting some explicit constraint i.e. L2 loss between De(x1) and De(x2). However, the real implementation looks like "just putting everything together" to me.

I think it is a great work, just need some clarification from you about some concepts and implementation details. Please please correct me if any of my understanding is wrong.

chrisdonahue commented 5 years ago

Thanks for your interest. Your understanding is correct; there is no explicit distance metric enforced. The "identity" feature vector is fixed and shared between both images in the generator (so the distance is always 0), and the discriminator outputs a single scalar.

In the case of the SD-DCGAN model (Fig 2b in the paper), this scalar can be thought of as the discriminator's encoding of both 1) a distance metric of facial similarity, and 2) whether or not the images are real or fake. The discriminator is tied up until the final layers, as in a standard Siamese network.

We do use distance metrics generated by FaceNet in our evaluation methodology.