VisualComputingInstitute / triplet-reid

Code for reproducing the results of our "In Defense of the Triplet Loss for Person Re-Identification" paper.
https://arxiv.org/abs/1703.07737
MIT License
764 stars 216 forks source link

Question about final layer of mobilenet #61

Closed voqtuyen closed 6 years ago

voqtuyen commented 6 years ago

Thanks for providing us the source code of the paper. I have a question regarding the final layer of mobilenet. What is the purpose of the reduce_mean operation here? https://github.com/VisualComputingInstitute/triplet-reid/blob/2760af1589f558f0f061855e72646a5c1dffe3db/nets/mobilenet_v1_1_224.py#L16 When i replace it by

endpoints['model_output'] = endpoints['global_pool'] = tf.reshape(endpoints['Conv2d_13_pointwise'], [-1, 8 4 1024])

It seems that the model accuracy decreases a lot. Is it possible to use tf.layers.average_pooling2d instead and how?

Thanks

Pandoro commented 6 years ago

The reduce_mean results in a form of spatial invariance, which in market is probably rather important.

Apart from that, what you do is create a HUGE dimensionality for your embedding. For one you encode the spatial location of things in your embedding which probably is not smart and additionally, you make the training a lot easier because the network can more freely move things in this huge space, probably resulting in a less general model. Those are just my intuitions, but I'm sure about the spatial location at least.

You could do it with average_pooling2d if you assume your images always have the same size, and you use the size as a pooling window. But I don't see any advantages over reduce_mean.

I'm closing this for now. If you have further questions feel free to re-open.