How is the siamese architecture implemented?

Thanks for this great work!

I have been studying the original paper and also the source codes, and I have some conceptual question about the siamese setup.

To my understanding, a typical siamese network generates individual feature vectors from two input images (with shared weights of this network), and it has to penalize the distance between the two generated feature vectors by applying a "siamese loss". In fact, FaceNet follows the similar idea because it deals with such a distance.

However, I did not see any "distance like" constraints in sdgan. Instead, according to this paper: "To adapt DCGAN, we stack the feature maps De(x1) and De(x2) along the channel axis, applying one additional strided convolution. This allows the network to further aggregate information from 2 images..."

When I first saw the word "Siamese" in this paper, I was expecting some explicit constraint i.e. L2 loss between De(x1) and De(x2). However, the real implementation looks like "just putting everything together" to me.

I think it is a great work, just need some clarification from you about some concepts and implementation details. Please please correct me if any of my understanding is wrong.

chrisdonahue / sdgan

How is the siamese architecture implemented? #3