add HGFCLIPTextEmbeddings

harishanand95 commented 1 year ago

Hey, I would like to add CLIPTextEmbedding, and later CLIPTextModel to Transformers.jl. While doing so, I want to learn about Transformers.jl implementation details and also learn how we compare pytorch model results with julia results.

Based on https://github.com/huggingface/transformers/blob/f68796bd603ef60173e093f50a1ecbb74bc2ba6b/src/transformers/models/clip/modeling_clip.py#L200

I have added a test_clip2.jl which is similar to the python code below.

from transformers import CLIPTextModel, CLIPTokenizer

tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")

text_input = tokenizer(
    ["a photo of an astronaut riding a horse on mars", ""],
    padding="max_length",
    max_length=tokenizer.model_max_length,
    truncation=True,
    return_tensors="pt",
)
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
# text_encoder(text_input.input_ids)
out = text_encoder.text_model.embeddings(input_ids=text_input.input_ids, position_ids=None)

print(out.shape) # 2, 77, 768
print(out[0, 0, :])

Please suggest corrections, improvement and better approaches to doing this, Thanks!

chengchingwen commented 1 year ago

There is a example/HuggingFaceValidation that use PyCall to compare the result, but that only work with a complete model, not single layer. If you want to make sure the embedding result is correct, you can follow how the code do and compare the result. Personally I would prefer having both CLIPTextEmbedding and CLIPTextModel in a single PR so the model can be tested directly.

For now (0.1.x releases), I won't add CI test for huggingface models. you can open a example/CLIP and put the test_clip2.jl there.

harishanand95 commented 1 year ago

Yeah, I will add CLIPTextModel too in this PR, so we can test it completely. Thanks for taking a look!

harishanand95 commented 1 year ago

Hi @chengchingwen, I have marked with TODO in model.jl where I'm not sure how to proceed. I have added causal attention masks, but I think it needs to combined with attention masks which are passed as inputs. Model gets loaded fine, and the results are slightly off, I think its due to how layer normalization code in Flux.normalize & torch.normalize differs. Let me know what you think. Thanks!

chengchingwen / Transformers.jl

add HGFCLIPTextEmbeddings #120