Open bnu-wangxun opened 7 years ago
Hi, thanks for the interesting questions, and also thanks for letting us know that you managed to reproduce the results, that's always good news!
Yes, you can see in Figure 6 of the supplementary material that we have the same experience with the hard margin: training loss is often zero, and number of active triplets in the batch is also often zero. The network is "done".
We did not have the time to explicitly search for making training even harder later on in the training, but it is a good idea to investigate and could potentially improve score even more. I can think of two easy ways to do this: a) change the batch to make having really hard samples more likely, by either increasing batch size if memory allows it, or maybe keeping batch size constant but reducing P and increasing K, or the other way around. b) adding more or stronger augmentation later in the training. IIRC, we did only flip and crop, but you could add squeeze, rotate, color/gamma-noise, etc.
Indeed, we also had better performance without that. We had some early experiments where adding normalization worked well, but overall we do not have good knowledge about when exactly it works and when not. Intuitively, normalizing makes more sense when using squared Euclidean distance (in this case, it equals to cosine distance between unnormalized vectors) whereas not normalizing makes more sense when using raw Euclidean distance (since the "units"/space of distance and vectors stay the same). But I performed some large-scale experiments after the paper and found no combination of norm/not-norm and squared/not-squared Euclidean that consistently worked or was consistently best.
Hello, thank you for your discussion. I have questions about margin and embeddings. If you did not normalize the embeddings how do you determine margin? I tried to (in my vgg-net) not normalize the embeddings however the distances are too big and obviously I could not select hard margin like 0.2.
As you can see in the paper, we used a custom train/val split of the MARS training-set to determine the margin, although in the end we ended up using the soft-margin which doesn't have a parameter.
If you're using a reasonable pre-trained backend and initialize the (new) final embedding layer reasonably, the distances should not be too large at the start. In fact, the distances should approach sqrt(2D)
where D
is embedding size, IIRC. It is a little difficult to see, but in the appendix of our paper, you can see that the distances indeed start around 15 ≊sqrt(2*128), but quickly drop at the start of training.
Really, without l2-norm, the margin should only roughly dictate the scale of the learned space.
@bnulihaixia Hello, could you share your caffe code with me? I am a beginner on deep learning and I have no idea about how to implement it. Thanks a lot.
@HuaZheLei Sorry, this work is done during my internship. The code is not in my hand. If you want to implement it in caffe, you can refer the lifted-structure [https://github.com/rksltnl/Deep-Metric-Learning-CVPR16] My caffe implement is modified from this repository.
@bnulihaixia Thanks for help. I will have a look.
How did you to do the hard triplet mining? How do you get access to the net weights while training
I have replicated your papar : << Defense of Triplet Loss >> , on the Market1501 dataset by caffe. It has good performance just as your paper said.
As I can see, the Batch hard loss without softplus function, will be 0 mostly when at the last iterations. So I want ask do you have tried any other type of hard mining (In your discussion section : Notes on network training )? If you have done, I want to hear more detail about your experiments performance .
Secondly, I also tried to add a L2 norm layer for the embedding , the training is not stable and the result is very poor. I read your explanation about that, but I think it can not explain such phenomenon. Because as I know, some other types of metric learning losses have good performance with L2 norm such as << DarkRank: Accelerating Deep Metric Learning via Cross Sample Similarities Transfer >> . I want to ask whether you have any deeper thinking on such phenomenon.