TomerRonen34 / mixed-resolution-vit

48 stars 2 forks source link

Vision Transformers with Mixed-Resolution Tokenization

Official repo for https://arxiv.org/abs/2304.00287 (CVPRW 2023 oral).

The Quadformer

Current release: Quadtree implementations and image tokenizers.

To be released: inference code and trained models.

Let me know if you're also interested in the Grad-CAM oracle code, or training code for reproducing the experiments in the paper (based on the timm library).

Setup

Install torch and torchvision, e.g. by following the official instructions.

Examples

See notebooks under examples/:

Patch scorers

We provide implementations for several patch scorers:

Quadtree implementations

We provide 3 different GPU-friendly implementations of the Saliency-Based Quadtree algorithm. They produce identical results. They share the same batchified code for image patchifying and patch scoring, and differ in their implementation of the patch-splitting logic.

  1. Dict-Lookup Quadtree \ When splitting a patch, its children are retrieved from a dictionary by their coordinates. This is perhaps the easiest implementation to read (see mixed_res.quadtree_impl.quadtree_dict_lookup.Quadtree.run), but it's also the slowest one as the actual splitting logic isn't batchified.

  2. Tensor-Lookup Quadtree \ Similar to the dict-lookup Quadtree, but the box indices are kept in a tensor lookup table, where tensor indices correspond to patch location and scale. Much faster than the dict-lookup Quadtree.

  3. Z-Curve Quadtree \ Uses an indexing scheme based on z-order curves, where the children of a patch can be found via a simple arithmetic operation. This is our fastest implementation. Unfortunately, it only works for image sizes that are a power of 2, e.g. 256 pixels.