This repo contains code, data, and models for the Sherlock corpus. If you find the paper, corpus, and models interesting or helpful for your own work, please consider citing:
@inproceedings{hesselhwang2022abduction,
title={{The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning}},
author={*Hessel, Jack and *Hwang, Jena D and Park, Jae Sung and Zellers, Rowan and Bhagavatula, Chandra and Rohrbach, Anna and Saenko, Kate and Choi, Yejin},
booktitle={ECCV},
year={2022}
}
We do not publicly release the test set labels, but do have a leaderboard. See the leaderboard section for more detail. In our experience, results on the validation/test sets are quite similar.
We collected a large corpus of abductive inferences over images. Abductive reasoning is the act of reasoning about plausible inferences in the case of uncertainty. Our corpus consists of 363K inferences across 103K images. Each inference is grounded in images via a bounding box. Our model predicts an abductive inference given an image and a bounding box. Example predictions of one of our best best performing models, alongside the human annotations, is given here:
The images for Sherlock are sourced from VisualGenome and VCR: if you find the sherlock corpus useful, please cite those works as well! To train a new model or get predictions on the validation/test sets, you will have to download these images locally. Please do not download the images from the URLs contained in the data we release, instead, use:
In addition, we release:
We release several pieces of code:
training_code
contains the scripts we used to train the CLIP-style models from the paper.demo
contains an jupyter notebook that you can use to explore the predictions of a pretrained modelleaderboard_eval
contains the official evaluation scripts, alongside leaderboard submission details.We release four pretrained versions of
CLIP, fit to the Sherlock
corpus. As detailed in the paper, the model is trained using InfoNCE,
and augmented to incorporate bounding boxes as input via the bounding
box being drawn on the image in pixel space directly. The most performant
model is RN50x64-multitask
; the fastest model is ViT/B-16
.
The checkpoints we release are:
ViT/B-16
571M downloadRN50x16
1.1G downloadRN50x64
2.3G downloadRN50x64-multitask
2.3G downloadSee the demo jupyter notebook for usage, and the leaderboard evaluation code for official evaluation code.
Currently, the Sherlock corpus is in verison 1.1. Verison 1.0 of the train/validation sets can be downloaded here. The models in the paper are trained mostly on the v1 corpora, but we observe very little difference in practice. We recommend using version 1.1 for all cases, unless you are specifically interested in exactly replicating the corpora the model checkpoints were trained on.
Sherlock (codebase) is licensed under the Apache License 2.0 (see CODE_LICENSE). Sherlock (dataset) is licensed under CC-BY (see DATASET_LICENSE).