Closed styler00dollar closed 3 years ago
I'm currently unable to reproduce this issue sadly:
Would you mind sharing which training script you used, and how you initialize model
?
After looking further, it seems that your text transformer comes from Hugging Face's transformers
library. Here's an example of how to tokenize and predict using that model:
encoded_text = tokenizer(sentence_list, return_tensors='pt')
model.encode_text(encoded_text)
Does this fix your problem?
Here is a Google Colab to replicate the issue. Upload the file to Google Colab, or change paths to do so locally with jupyter. I also saved all error messages in that jupyter notebook. My assumption is that it may be related to pip package versions. CLIP_bug.zip
The only major thing I added was checkpoint.py
to save .pth
files during training. I use the code from train_finetune.py
to create the model and load a state dict into that model. Since you have "model" and "teacher" in a checkpoint, I do model.model
to use the actual model.
import clip
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
from models import CustomCLIPWrapper
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-sci-base")
txt_encoder = AutoModel.from_pretrained("johngiorgi/declutr-sci-base")
from torchvision.models import resnet50
img_encoder = resnet50(pretrained=True)
img_encoder.fc = torch.nn.Linear(2048, 768)
model = CustomCLIPWrapper(img_encoder, txt_encoder, 0, avg_word_embs=True)
model.load_state_dict(torch.load("/content/test.pth"))
model.to(device)
with torch.no_grad():
tmp = clip.tokenize(["test"])
tmp = tmp.to(device)
print(tmp)
print(tmp.shape)
text_encoded = model.model.encode_text(tmp)
I tried the suggested code, but i got this instead.
with torch.no_grad():
encoded_text = tokenizer(["test"], return_tensors='pt')
model.model.encode_text(encoded_text)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-ffb3e976c349> in <module>()
1 with torch.no_grad():
2 encoded_text = tokenizer("test", return_tensors='pt')
----> 3 model.model.encode_text(encoded_text)
/content/train-CLIP/models/model.py in encode_text(self, text)
339
340 def encode_text(self, text):
--> 341 x = self.token_embedding(text).type(self.dtype) # [batch_size, n_ctx, d_model]
342
343 x = x + self.positional_embedding.type(self.dtype)
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1050 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051 return forward_call(*input, **kwargs)
1052 # Do not call functions when jit is used
1053 full_backward_hooks, non_full_backward_hooks = [], []
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py in forward(self, input)
158 return F.embedding(
159 input, self.weight, self.padding_idx, self.max_norm,
--> 160 self.norm_type, self.scale_grad_by_freq, self.sparse)
161
162 def extra_repr(self) -> str:
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2041 # remove once script supports set_grad_enabled
2042 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2043 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2044
2045
TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not BatchEncoding
The script train_finetune.py
trains the second class in wrapper.py
called CustomCLIPWrapper
. This class has a special embedding function already attached to the class. You don't need to call model.model.encode_text
as HF models have a different way to be called. The function you are calling is the original CLIP models forward which fails because of a different tokenization protocol. If you call model.encode_text
with the right tokenizer all should be good!
I did not notice that there are 2 functions called encode_text
and assumed there is only one.
with torch.no_grad():
encoded_text = tokenizer(["test"], return_tensors='pt').to(device)
result = model.encode_text(encoded_text)
print(result)
tensor([[-7.9948e-01, 3.2338e-01, 1.7573e-01, -4.5223e-01, -2.1422e-01,
3.6682e-02, -8.9392e-02, -1.0695e+00, -3.5576e-01, 1.2232e+00,
...
It seems to work, thank you.
Happy to help!
I am trying to use a resnet50 model that I created with this repo, but I can't encode text.
Printing
x
beforeself.transformer(x)
results intorch.Size([77, 1, 512])
.The input shape
torch.Size([1, 77])
does match the original clip code and the model loaded with clip seems to work without major problems.Not sure what I am doing wrong, since encoding images does seem to work fine with this repo.