SunzeY / AlphaCLIP

[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
https://aleafy.github.io/alpha-clip
Apache License 2.0
703 stars 43 forks source link
deep-learning machine-learning vision-and-language vision-language vision-language-model vision-transformer

Alpha-CLIP

This repository is the official implementation of AlphaCLIP

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Zeyi Sun*, Ye Fang*, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

*Equal Contribution

Demo Alpha-CLIP with Stable Diffusion: Hugging Face Spaces Open in OpenXLab

Demo Alpha-CLIP with LLaVA: Hugging Face Spaces Open in OpenXLab

πŸ“œ News

πŸš€ [2024/7/19] We have launched training code as well as data MaskImageNet!

πŸš€ [2024/3/4] CLIP-L/14@336px finetuned on GRIT-20M is available, checkout model-zoo!

πŸš€ [2024/2/27] Our paper Alpha-CLIP is accepted by CVPR'24!

πŸš€ [2024/1/2] Zero-shot testing code for Imagenet-S Classification and Referring Expression Comprehension are released!

πŸš€ [2023/12/27] Web demo and local demo of Alpha-CLIP with LLaVA are released!

πŸš€ [2023/12/7] Web demo and local demo of Alpha-CLIP with Stable Diffusion are released!

πŸš€ [2023/12/7] The paper and project page are released!

πŸ’‘ Highlights

πŸ‘¨β€πŸ’» Todo

πŸ› οΈ Usage

Installation

our model is based on CLIP, please first prepare environment for CLIP, then directly install Alpha-CLIP.

pip install -e .

install loralib

pip install loralib

how to use

Download model from model-zoo and place it under checkpoints.

import alpha_clip
alpha_clip.load("ViT-B/16", alpha_vision_ckpt_pth="checkpoints/clip_b16_grit1m_fultune_8xe.pth", device="cpu"), 
image_features = model.visual(image, alpha)

alpha need to be normalized via transforms when using binary_mask in (0, 1)

mask_transform = transforms.Compose([
    transforms.ToTensor(), 
    transforms.Resize((224, 224)),
    transforms.Normalize(0.5, 0.26)
])
alpha = mask_transform(binary_mask * 255)

Training

Please refer to here

Zero-shot Prediction

import torch
import alpha_clip
from PIL import Image
import numpy as np
from torchvision import transforms

# load model and prepare mask transform
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = alpha_clip.load("ViT-L/14", alpha_vision_ckpt_pth="./checkpoints/clip_l14_grit20m_fultune_2xe.pth", device=device)  # change to your own ckpt path
mask_transform = transforms.Compose([
    transforms.ToTensor(), 
    transforms.Resize((224, 224)), # change to (336,336) when using ViT-L/14@336px
    transforms.Normalize(0.5, 0.26)
])

# prepare image and mask
img_pth = './examples/image.png'
mask_pth = './examples/dress_mask.png' # image-type mask

image = Image.open(img_pth).convert('RGB')
mask = np.array(Image.open(mask_pth)) 
# get `binary_mask` array (2-dimensional bool matrix)
if len(mask.shape) == 2: binary_mask = (mask == 255)
if len(mask.shape) == 3: binary_mask = (mask[:, :, 0] == 255)

alpha = mask_transform((binary_mask * 255).astype(np.uint8))
alpha = alpha.half().cuda().unsqueeze(dim=0)

# calculate image and text features
image = preprocess(image).unsqueeze(0).half().to(device)
text = alpha_clip.tokenize(["a goegously dressed woman", "a purple sleeveness dress", "bouquet of pink flowers"]).to(device)

with torch.no_grad():
    image_features = model.visual(image, alpha)
    text_features = model.encode_text(text)

# normalize
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)

## print the result
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", similarity.cpu().numpy()) # prints: [[9.388e-05 9.995e-01 2.415e-04]]

Note: Using .half() for tensor or .float() for model to maintain type consistency.

More usage examples are available:

⭐ Demos

❀️ Acknowledgments

βœ’οΈ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation πŸ“

@misc{sun2023alphaclip,
      title={Alpha-CLIP: A CLIP Model Focusing on Wherever You Want}, 
      author={Zeyi Sun and Ye Fang and Tong Wu and Pan Zhang and Yuhang Zang and Shu Kong and Yuanjun Xiong and Dahua Lin and Jiaqi Wang},
      year={2023},
      eprint={2312.03818},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

Code License Data License Usage and License Notices: The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.