beichenzbc / Long-CLIP

[ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"
Apache License 2.0
690 stars 33 forks source link

continuing pre-training #50

Closed mikelee-dev closed 3 months ago

mikelee-dev commented 4 months ago

Hi, I would like to continue your pre-training.

Is there a readme that describes how to load in your pre-trained Long-CLIP model checkpoint? Looking for a checkpoint from "ViT-L/14@336px".

Thanks!

beichenzbc commented 4 months ago

Hi, you may refer to this code:

from model import longclip
import torch
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = longclip.load("./checkpoints/longclip-B.pt", device=device)

text = longclip.tokenize(["A man is crossing the street with a red car parked nearby.", "A man is driving a car in an urban scene."]).to(device)
image = preprocess(Image.open("./img/demo.png")).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    logits_per_image = image_features @ text_features.T
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs) 
mikelee-dev commented 4 months ago

Sorry, i meant training, not zero-shot inference

beichenzbc commented 4 months ago

Got it. Than you may change the 'load_from_clip' function in train/train.py into 'longclip.load', and rewrite the dataset loader.