Zasder3 / train-CLIP

A PyTorch Lightning solution to training OpenAI's CLIP from scratch.
MIT License
653 stars 78 forks source link

Problem related to encoding text #11

Closed styler00dollar closed 3 years ago

styler00dollar commented 3 years ago

I am trying to use a resnet50 model that I created with this repo, but I can't encode text.

with torch.no_grad():
    tmp = clip.tokenize("test")
    tmp = tmp.to(device)
    print(tmp)
    print(tmp.shape)
    text_encoded = model.model.encode_text(tmp)
tensor([[49406,  1628, 49407,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]], device='cuda:0')
torch.Size([1, 77])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-68003eb3bebb> in <module>()
      9     print(tmp)
     10     print(tmp.shape)
---> 11     text_encoded = model.model.encode_text(tmp)
     12 

2 frames
/content/train-CLIP/models/model.py in encode_text(self, text)
    343         x = x + self.positional_embedding.type(self.dtype)
    344         x = x.permute(1, 0, 2)  # NLD -> LND
--> 345         x = self.transformer(x)
    346         x = x.permute(1, 0, 2)  # LND -> NLD
    347         x = self.ln_final(x).type(self.dtype)

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    937         elif input_ids is not None:
    938             input_shape = input_ids.size()
--> 939             batch_size, seq_length = input_shape
    940         elif inputs_embeds is not None:
    941             input_shape = inputs_embeds.size()[:-1]

ValueError: too many values to unpack (expected 2)

Printing x before self.transformer(x) results in torch.Size([77, 1, 512]).

The input shape torch.Size([1, 77]) does match the original clip code and the model loaded with clip seems to work without major problems.

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device, jit=False)

image = preprocess(Image.open("/test.png")).unsqueeze(0).to(device)
text = clip.tokenize(["test"]).to(device)
print(text)
print(text.shape)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()
tensor([[49406,  1628, 49407,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]], device='cuda:0')
torch.Size([1, 77])

Not sure what I am doing wrong, since encoding images does seem to work fine with this repo.

with torch.no_grad():
    photos_features = model.model.encode_image(image)
    photos_features /= photos_features.norm(dim=-1, keepdim=True)

print(photos_features.shape)
torch.Size([1, 768])
Zasder3 commented 3 years ago

I'm currently unable to reproduce this issue sadly:

Screenshot from 2021-07-09 14-34-31

Would you mind sharing which training script you used, and how you initialize model?

Zasder3 commented 3 years ago

After looking further, it seems that your text transformer comes from Hugging Face's transformers library. Here's an example of how to tokenize and predict using that model:

encoded_text = tokenizer(sentence_list, return_tensors='pt')
model.encode_text(encoded_text)

Does this fix your problem?

styler00dollar commented 3 years ago

Here is a Google Colab to replicate the issue. Upload the file to Google Colab, or change paths to do so locally with jupyter. I also saved all error messages in that jupyter notebook. My assumption is that it may be related to pip package versions. CLIP_bug.zip

The only major thing I added was checkpoint.py to save .pth files during training. I use the code from train_finetune.py to create the model and load a state dict into that model. Since you have "model" and "teacher" in a checkpoint, I do model.model to use the actual model.

import clip
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
from models import CustomCLIPWrapper
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-sci-base")
txt_encoder = AutoModel.from_pretrained("johngiorgi/declutr-sci-base")

from torchvision.models import resnet50
img_encoder = resnet50(pretrained=True)
img_encoder.fc = torch.nn.Linear(2048, 768)

model = CustomCLIPWrapper(img_encoder, txt_encoder, 0, avg_word_embs=True)
model.load_state_dict(torch.load("/content/test.pth"))
model.to(device)

with torch.no_grad():
    tmp = clip.tokenize(["test"])
    tmp = tmp.to(device)
    print(tmp)
    print(tmp.shape)
    text_encoded = model.model.encode_text(tmp)

I tried the suggested code, but i got this instead.

with torch.no_grad():
    encoded_text = tokenizer(["test"], return_tensors='pt')
    model.model.encode_text(encoded_text)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-ffb3e976c349> in <module>()
      1 with torch.no_grad():
      2     encoded_text = tokenizer("test", return_tensors='pt')
----> 3     model.model.encode_text(encoded_text)

/content/train-CLIP/models/model.py in encode_text(self, text)
    339 
    340     def encode_text(self, text):
--> 341         x = self.token_embedding(text).type(self.dtype)  # [batch_size, n_ctx, d_model]
    342 
    343         x = x + self.positional_embedding.type(self.dtype)

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py in forward(self, input)
    158         return F.embedding(
    159             input, self.weight, self.padding_idx, self.max_norm,
--> 160             self.norm_type, self.scale_grad_by_freq, self.sparse)
    161 
    162     def extra_repr(self) -> str:

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2041         # remove once script supports set_grad_enabled
   2042         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2043     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2044 
   2045 

TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not BatchEncoding
Zasder3 commented 3 years ago

The script train_finetune.py trains the second class in wrapper.py called CustomCLIPWrapper. This class has a special embedding function already attached to the class. You don't need to call model.model.encode_text as HF models have a different way to be called. The function you are calling is the original CLIP models forward which fails because of a different tokenization protocol. If you call model.encode_text with the right tokenizer all should be good!

styler00dollar commented 3 years ago

I did not notice that there are 2 functions called encode_text and assumed there is only one.

with torch.no_grad():
    encoded_text = tokenizer(["test"], return_tensors='pt').to(device)
    result = model.encode_text(encoded_text)
    print(result)
tensor([[-7.9948e-01,  3.2338e-01,  1.7573e-01, -4.5223e-01, -2.1422e-01,
          3.6682e-02, -8.9392e-02, -1.0695e+00, -3.5576e-01,  1.2232e+00,
...

It seems to work, thank you.

Zasder3 commented 3 years ago

Happy to help!