Generating Visual Explanations

Generating Visual Explanations Clearly explaining a rationale for a classification decision to an end user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discriminative image aspects which justify visual predictions. We propose a new model that focuses on the discriminating properties of the visible object, jointly predicts a class label, and explains why the predicted label is appropriate for the image. Through a novel loss function based on sampling and reinforcement learning, our model learns to generate sentences that realize a global sentence property, such as class specificity. Our results on the CUB dataset show that our model is able to generate explanations which are not only consistent with an image but also more discriminative than descriptions produced by existing captioning methods.

Bibtex:

@Inbook{Hendricks2016, author="Hendricks, Lisa Anne and Akata, Zeynep and Rohrbach, Marcus and Donahue, Jeff and Schiele, Bernt and Darrell, Trevor", editor="Leibe, Bastian and Matas, Jiri and Sebe, Nicu and Welling, Max", title="Generating Visual Explanations", bookTitle="Computer Vision -- ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part IV", year="2016", publisher="Springer International Publishing", pages="3--19", url="https://doi.org/10.1007/978-3-319-46493-0_1" }

From previous review: Recent work by Hendricks et al. (2016) attempts to provide a model that both classifies images, and provides accurate text explanations for why the image belongs to a particular category. Their description generation method is inspired by recent advances in automatic captioning techniques, which aim to provide appropriate text descriptions of images or videos. Hendricks et al. (2016) expand on Donahue et al.’s captioning and recognition method (Donahue et al. 2017). Their model consists of a CNN that extracts visual features, combined with two LSTM RNNs that learn to create a caption. The first RNN, trained on the image descriptions, generates words based only on the previously generated word, while the second RNN is fed the first RNN’s output, the image features, and the image category predicted by the CNN. The second RNN then generates the next word conditioned on this input. They show that this method generates image and class-relevant explanations for classification decisions on a difficult bird-species discrimination task. The results are impressive; however, the model does not guarantee that the descriptions it learns will correspond to the visual features that humans interpret them as referring to, and does not provide a way to check this.

dais-ita / interpretability-papers

Generating Visual Explanations #38