Open NickleDave opened 4 years ago
Here's a PyTorch specific implementation of WARP for neural nets https://medium.com/@gabrieltseng/intro-to-warp-loss-automatic-differentiation-and-pytorch-b6aa5083187a
In terms of the paper, seems valid and useful to point out that not known what it means to "task optimize" when task involves "detecting a target among multiple objects", i.e., visual search, and so we should investigate multiple types of "task optimizing"
at least one paper finds training with just single-labels using cross-entropy can be surprisingly competitive for Pascal VOC which specifically is the dataset we are using, see #41 this is the paper: https://arxiv.org/pdf/1612.03663.pdf
at least one paper has extended this to deep nets, they develop "smooth loss functions for deep top-k classification" https://arxiv.org/pdf/1802.07595.pdf and they have a PyTorch implementation: https://github.com/oval-group/smooth-topk
trying to figure out how to 'task optimize' a network for object recognition when multiple objects can be in an image (which, let's face it, is always the case)
seems like it does not make sense to just have a softmax and train with cross-entropy, because this enforces exactly one winning prediction per image. Can argue about whether the brain does this, but for our model, we don't want to disadvantage it and then have it look like its lower accuracy correlates with Visual Search Difficulty scores just because it can never be perfect. Better to train the "best" way and still see an effect of difficulty
also doesn't make sense to have the output be "present / absent" because we would then be training it to output present only for labels that are in the training set (ignoring a class of object / target that might actually be in the image, we just haven't labeled it)
Hence seems like we still want multi-label classification
Can't tell what SoTA is though for training out-of-the-box CNNs for multi-label image classifcation--just a bunch of fancy methods without anyone showing directly how bad a vanilla CNN is.
BinaryCrossEntropy seems to be standard. Seems like WARP is one method papers that propose fancy methods point to. Key idea is to learn a ranking using a sampling strategy, so we efficiently learn to rank positive samples higher than negatives https://www.aaai.org/ocs/index.php/IJCAI/IJCAI11/paper/view/2926/3666 This paper applied a WARP-like loss to CNNs: https://arxiv.org/pdf/1312.4894.pdf An alternative is a pairwise loss: http://openaccess.thecvf.com/content_cvpr_2017/papers/Li_Improving_Pairwise_Ranking_CVPR_2017_paper.pdf but this looks more involved, not trained end-to-end, and a separate classifier
Some potentially useful implemenations of different losses for recommender systems here: https://github.com/maciejkula/spotlight/blob/master/spotlight/losses.py They call WARP "adaptive hinge loss" but cite the paper https://github.com/maciejkula/spotlight/blob/75f4c8c55090771b52b88ef1a00f75bb39f9f2a9/spotlight/losses.py#L127