Open dnaq opened 7 years ago
Hi @dnaq, thanks for your feedback. We haven't tried ms-celeb-1m or casia-webface yet, but I have gained some more experience applying batch-hard to other datasets since (DukeMTMC, CelebA mainly), and had similar experience to what you describe, as those are significantly harder. The effect you observe is what we describe in the appendix F, the initial "difficult" phase. When the network is not able to leave that difficult phase, it means the task is too hard for the network. In order to fix that and make the network converge, you need to make it easier in the beginning. You can achieve this in several ways:
1) make the network more powerful. For example, on one dataset where the network often got stuck, changing from ResNet to a ResNeXt architecture made the training very stable and smooth. 2) make the task easier, by starting with smaller batches. In a PK-batch of, for example, 2 instances of 4 persons, there will unlikely be difficult examples, and thus it becomes semi-hard instead. You can then slowly grow the batch-size over time. On a difficult dataset, I was able to train a network reliably by growing the batch-size linearly from 24 to 432 over time. 3) If the above two don't work or are not possible to do for you, another possibility to help the network overcome this initial phase of collapsing embeddings, is to add a regularization loss on the distances between embeddings, such as 1/d or 1/d², -d or anything that has the effect of countering collapse to a single point. You can then disable or turn down this loss later once the network has gotten past the difficult phase. Note that normalizing the embeddings won't help here, as they can (and will) still collapse somewhere on the hypersphere.
Finally, I want to mention that I believe I have found another simple to implement trick to make it work reliably on hard/unclean datasets, but don't want to reveal it publicly yet as I might want to write a paper about it. Let me know if you are potentially interested in collaboration, or, if you are not into publications, we can discuss privately.
But in the end, use whatever works for your use-case. If center-loss works great for you, use it :smile:
Hi and thanks for your thorough reply, I'll try to implement some of your suggestions to see if I get better results when I have the time. Some notes on your different suggestions below:
I tried using both ResNet and inception-resnet-v2, but I haven't tried ResNeXt. I also did an experiment using xception, but with the same results.
This seems to be interesting. I haven't tried adjusting the batch size yet, so I will do so when I get the opportunity.
I haven't tried adding a regularization loss, but I tried adding a classifier layer after the embeddings and added different factors of the cross-entropy loss of that layer to see what would happen. It kept the embeddings from collapsing, but the network never converged, and after removing the classifier layer the embeddings collapsed. I did some experiments with different weights for the cross-entropy loss.
Center-loss unfortunately isn't a good match for my use case, I just implemented it to sanity check the rest of my implementation.
I would be very interested in being kept in the loop on your further work. I'm not into doing publications, but would be interested in continuing this discussion.
I'll send you an email so that you have my email address.
On Sat, 9 Sep 2017 at 10:37, Lucas Beyer notifications@github.com wrote:
Hi @dnaq https://github.com/dnaq, thanks for your feedback. We haven't tried ms-celeb-1m or casia-webface yet, but I have gained some more experience applying batch-hard to other datasets since (DukeMTMC, CelebA mainly), and had similar experience to what you describe, as those are significantly harder. The effect you observe is what we describe in the appendix F, the initial "difficult" phase. When the network is not able to leave that difficult phase, it means the task is too hard for the network. In order to fix that and make the network converge, you need to make it easier in the beginning. You can achieve this in several ways:
- make the network more powerful. For example, on one dataset where the network often got stuck, changing from ResNet to a ResNeXt architecture made the training very stable and smooth.
- make the task easier, by starting with smaller batches. In a PK-batch of, for example, 2 instances of 4 persons, there will unlikely be difficult examples, and thus it becomes semi-hard instead. You can then slowly grow the batch-size over time. On a difficult dataset, I was able to train a network reliably by growing the batch-size linearly from 24 to 432 over time.
- If the above two don't work or are not possible to do for you, another possibility to help the network overcome this initial phase of collapsing embeddings, is to add a regularization loss on the distances between embeddings, such as 1/d or 1/d², -d or anything that has the effect of countering collapse to a single point. You can then disable or turn down this loss later once the network has gotten past the difficult phase. Note that normalizing the embeddings won't help here, as they can (and will) still collapse somewhere on the hypersphere.
Finally, I want to mention that I believe I have found another simple to implement trick to make it work reliably on hard/unclean datasets, but don't want to reveal it publicly yet as I might want to write a paper about it. Let me know if you are potentially interested in collaboration, or, if you are not into publications, we can discuss privately.
But in the end, use whatever works for your use-case. If center-loss works great for you, use it 😄
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/VisualComputingInstitute/triplet-reid/issues/4#issuecomment-328263544, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPVBQ_hjbY60l7VzBBUdeSvWFmkkyHzks5sgk5PgaJpZM4PPWj1 .
Another option, similar to your point 3, I've seen people successfully train using classification first, and then fine-tune using triplets alone. I haven't tried it myself, but I've seen success stories with that combo.
Closing this issue for now, because this is as much as I can help here. Continuing some of this via e-mail.
PS: I decided it would be good to keep issues open for visibility, and just tag them with "discussion" if it's something that can help others. Re-opening.
@ergysr Thanks for the recommendation. I might try that as well.
How would be an easy way to implement dynamic batch_size in tensorflow?
@filipetrocadoferreira that is off-topic for this issue. In general there are many ways, a google search will help you, but for this specific code-base, it was never intended, so might need many changes.
@lucasb-eyer:
I haven't seen a paper regarding "another simple to implement trick to make it work reliably on hard/unclean datasets" in your list of publications, though I am very interested in a method like that. Are you planning on publishing it soon?
If not, can we talk offline?
@spott You might want to check Section 3.1 in this paper: https://arxiv.org/pdf/1803.10859.pdf
Hi @spott, the trick in @ergysr paper is good. Mine is very similar, but I'm not gonna publish it anytime soon I think, so here we go:
instead of taking the max/min as positive/negative for an anchor, compute the softmax over the distances (exactly as in Ergy's paper) and use that as a distribution to sample which positive/negative elements to use. This way early in training, one doesn't always take the hard ones, and as the training goes on, it converges towards batch-hard. In general, with this trick, I've been able to consistently get convergence even on the hardest tasks. It reaches about similar performance to batch-hard, sometimes a little worse, sometimes a little better. The code is a minor two-line change from the current code here, I might make it public at some point.
I'll add it to my thesis, so if you (or someone else reading this) use it I would be happy about a citation of my thesis once it's out :smile: Actually, one of the very first triplet papers already did something similar, but this paper seems to have been forgotten.
PS: nice paper @ergysr !
I've been doing some experiments with your batch hard triplet loss function and different architectures/datasets. On MARS I manage to reproduce the results from your paper (network seems to converge), but with many other datasets I get stuck at a loss of ~0.6931 which is softplus(0). Looking at the embedding it seems like the network starts to yield the same embeddings for all different classes.
Worth to know is that a center loss formulation works quite well for generating usable embeddings for these datasets, I've tried with me-celeb-1m (after cleaning it up), and with casia-webface.
My interpretation of these results is that the batch hard triplet loss function is really sensitive to mislabeled datasets, and it might get stuck in a local minima if the dataset contains mislabeled images. I've tried some hyperparameter tuning (e. g. changing lr and optimizer), but I haven't managed to avoid the local minimum.
Have you seen similar results in your work when experimenting with different datasets?