utterances-bot commented 1 year ago

The Illustrated Image Captioning using transformers - Ankur NLP Enthusiast

The Illustrated Image Captioning using transformers

https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/

Ankur3107 commented 1 year ago

To train on a large set, you can use a torch data iterator.

import torch
from PIL import Image
class ImageCapatioingDataset(torch.utils.data.Dataset):
    def __init__(self, ds, ds_type, max_target_length):
        self.ds = ds
        self.max_target_length = max_target_length
        self.ds_type = ds_type

    def __getitem__(self, idx):
        image_path = self.ds[self.ds_type]['image_path'][idx]
        caption = self.ds[self.ds_type]['caption'][idx]
        model_inputs = dict()
        model_inputs['labels'] = self.tokenization_fn(caption, self.max_target_length)
        model_inputs['pixel_values'] = self.feature_extraction_fn(image_path)
        return model_inputs

    def __len__(self):
        return len(self.ds[self.ds_type])

    # text preprocessing step
    def tokenization_fn(self, caption, max_target_length):
        """Run tokenization on caption."""
        labels = tokenizer(caption, 
                          padding="max_length", 
                          max_length=max_target_length).input_ids

        return labels

    # image preprocessing step
    def feature_extraction_fn(self, image_path):
        """
        Run feature extraction on images
        If `check_image` is `True`, the examples that fails during `Image.open()` will be caught and discarded.
        Otherwise, an exception will be thrown.
        """
        image = Image.open(image_path)

        encoder_inputs = feature_extractor(images=image, return_tensors="np")

        return encoder_inputs.pixel_values[0]

train_ds = ImageCapatioingDataset(ds, 'train', 64)
eval_ds = ImageCapatioingDataset(ds, 'validation', 64)

# instantiate trainer
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=feature_extractor,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    data_collator=default_data_collator,
)

yesidc commented 1 year ago

Hi thanks for the tutorial! I am trying to run your code but it throws an error that maybe you could help me out with. The source of the error seems to be the image feature extractor. The error occurs when normalizing the images: image_utils.py line 237. The error occurs when processing image_id 573223, that strangely has dimensions (224,224) while the rest of the images (processed up to the point where the error occurs) have dimensions (3,224,224). Error:

ValueError: operands could not be broadcast together with shapes (224,224) (3,)

Thanks in advance!

pleomax0730 commented 1 year ago

@yesidc I ran the code in colab without problem.

yesidc commented 1 year ago

I finally managed to run it. For one thing, the preprocessing function provided by the transformers API throws an error when processing black and white images (it always expects 3-channel images). I had to override this function and now the code works (I am using transformers 4.25.1).

Aaryan562 commented 1 year ago

Unrecognized feature extractor in /content/image-captioning-output. Should have a feature_extractor_type key in its preprocessor_config.json of config.json, or one of the following model_type keys in its config.json: audio-spectrogram-transformer, beit, chinese_clip, clip, clipseg, conditional_detr, convnext, cvt, data2vec-audio, data2vec-vision, deformable_detr, deit, detr, dinat, donut-swin, dpt, flava, glpn, groupvit, hubert, imagegpt, layoutlmv2, layoutlmv3, levit, maskformer, mctct, mobilenet_v1, mobilenet_v2, mobilevit, nat, owlvit, perceiver, poolformer, regnet, resnet, segformer, sew, sew-d, speech_to_text, swin, swinv2, table-transformer, timesformer, unispeech, unispeech-sat, van, videomae, vilt, vit, vit_mae, vit_msn, wav2vec2, wav2vec2-conformer, wavlm, whisper, xclip, yolos

I am getting this kind of error why is it so?

Aaryan562 commented 1 year ago

The error is occurring in the inference stage when i am trying to load the pipeline.

DeependraParichha1004 commented 1 year ago

Hi Ankur, what if we want multiple captions of the same image?

newbietuan commented 1 year ago

Hi Ankur, i want to do something between encoder and decoder, so i define the model as follows:

  class caption_model(nn.Module):
      def __init__(self, args):
          super(caption_model, self).__init__()
          self.args = args
          self.gpt2_type = self.args.gpt2_type
          self.config = GPT2Config.from_pretrained('./gpt/' + self.gpt2_type)
          self.config.add_cross_attention = True
          # self.config.is_decoder = True
          self.config.is_encoder_decoder = True
          self.encoder = ViTModel.from_pretrained('./vit', local_files_only=True)
          self.decoder = GPT2LMHeadModel.from_pretrained('./gpt/'+self.gpt2_type, config=self.config)

    def forward(self, pixel_values, input_ids):

        image_feat = self.encoder(pixel_values)
        encoder_outputs = image_feat.last_hidden_state
        # encoder_outputs = do something
        output = self.decoder(input_ids=input_ids, encoder_hidden_states=encoder_outputs)

        return output.logits

while i get some throuble at the inference stage. it seems i should set is_encoder_decoder = True to use the "class BeamSearchEncoderDecoderOutput(ModelOutput):" in the generation_utils.py, but there will "torch.nn.modules.module.ModuleAttributeError: 'GPT2LMHeadModel' object has no attribute 'get_encoder'",
indeed, VisionEncoderDecoderModel complement the ViT-GPT2 for image captioning, but this is integrated, i couldn't do something between encoder an decoder, when i take it apart, i couldn't complete the beam_search stage, it seems impossible to rewrite the beam_search, do you have any suggestions or how should i set parameters to directly call generate(). thank you very much.

Ankur3107 commented 1 year ago

@newbietuan I think, you should ask this to huggingface https://github.com/huggingface/transformers/issues. They will give you better response. I will try, If I get anything will update you here.

Ankur3107 commented 1 year ago

@Aaryan562 There might me some transformers version issue.

Ankur3107 commented 1 year ago

@DeependraParichha1004

You may have to use combination of num_return_sequences, num_beams, penalty_alpha, top_k, top_p etc.

You can refer from:

from transformers import pipeline

image_to_text = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning")

generate_kwargs = {
    "num_return_sequences":3,
     "num_beams":3
}
image_to_text("https://ankur3107.github.io/assets/images/image-captioning-example.png", generate_kwargs=generate_kwargs)

newbietuan commented 1 year ago

@newbietuan I think, you should ask this to huggingface https://github.com/huggingface/transformers/issues. They will give you better response. I will try, If I get anything will update you here.

Thank you very much!

Aaryan562 commented 1 year ago

@Aaryan562 There might me some transformers version issue.

Are you sure that if I copy pasted each and every line would not give me any errors? or there are some changes

Aaryan562 commented 1 year ago

@Aaryan562 There might me some transformers version issue.

I also check the config.json and it had model_type key ='vit' in it then also it is giving value error

Aaryan562 commented 1 year ago

@Aaryan562 There might me some transformers version issue.

Can you also tell me how to resolve the version issue pls

DeependraParichha1004 commented 1 year ago

got it @Ankur3107 thankyou for the explanation.

TheTahaaa commented 1 year ago

@Aaryan562 There might me some transformers version issue.

Can you also tell me how to resolve the version issue pls

Hi, I also have this issue. Have you found a solution?

Aaryan562 commented 1 year ago

@Aaryan562 There might me some transformers version issue.

Can you also tell me how to resolve the version issue pls

Hi, I also have this issue. Have you found a solution?

No, i have not are you also getting the error in the inference stage??

AnhaarHussain commented 1 year ago

@Aaryan562 There might me some transformers version issue.

So, which version of Transformer should we use?

mohnish-7 commented 1 year ago

How to load a custom local dataset using the load_data() ? I have downloaded the flickr30k dataset which has images and captions in separate folders.

katiele47 commented 1 year ago

Hi, I also got the same error during inference stage ValueError: Unrecognized feature extractor in ./instagram-captioning-output. Should have a `feature_extractor_type` key in its preprocessor_config.json of config.json, or one of the following `model_type` keys in its config.json: audio-spectrogram-transformer, beit, chinese_clip, clap, clip, clipseg, conditional_detr, convnext, cvt, data2vec-audio, data2vec-vision, deformable_detr, deit, detr, dinat, donut-swin, dpt, flava, glpn, groupvit, hubert, imagegpt, layoutlmv2, layoutlmv3, levit, maskformer, mctct, mobilenet_v1, mobilenet_v2, mobilevit, nat, owlvit, perceiver, poolformer, regnet, resnet, segformer, sew, sew-d, speech_to_text, speecht5, swin, swinv2, table-transformer, timesformer, tvlt, unispeech, unispeech-sat, van, videomae, vilt, vit, vit_mae, vit_msn, wav2vec2, wav2vec2-conformer, wavlm, whisper, xclip, yolos

katiele47 commented 1 year ago

I resolved it by downgrading transfomers to !pip install transformers===4.28.0, using python 3.9, and manually modify the existing model_type: "vision-encoder-decoder" to model_type: "vit" in the config.json file. Not totally sure if this is correct, but it worked.

nada-dot commented 1 year ago

HELLO, i have an error model is not defined ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ in <cell line: 4>:5 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ NameError: name 'model' is not defined

eduardofarina commented 1 year ago

Hi @Ankur, thanks for this amazing work. Is there a way to extract the probability for the predicted tokens in inference? Best,

kohstall commented 1 year ago

@katiele47 this seems to solve it :)

kohstall commented 1 year ago

I receive [{'generated_text': 'A train coming down the tracks in the city.'}] no matter the image. What am I missing? Any parameters for training I need to adjust? Many thanks!

dramab commented 7 months ago

I resolved it by downgrading transfomers to !pip install transformers===4.28.0, using python 3.9, and manually modify the existing model_type: "vision-encoder-decoder" to model_type: "vit" in the config.json file. Not totally sure if this is correct, but it worked.

I meet the same problem . as your solve , turn out 'You are using a model of type vit to instantiate a model of type vision-encoder-decoder. This is not supported for all configurations of models and can yield errors.' question.

dramab commented 7 months ago

I find the solution about 'ValueError: Unrecognized feature extractor in ./instagram-captioning-output. Should have a feature_extractor_type key in its preprocessor_config.json of config.json, or one of the following model_type keys in its config.json: audio-spectrogram-transformer, beit, chinese_clip, clap, clip, clipseg, conditional_detr, convnext, cvt, data2vec-audio, data2vec-vision, deformable_detr, deit, detr, dinat, donut-swin, dpt, flava, glpn, groupvit, hubert, imagegpt, layoutlmv2, layoutlmv3, levit, maskformer, mctct, mobilenet_v1, mobilenet_v2, mobilevit, nat, owlvit, perceiver, poolformer, regnet, resnet, segformer, sew, sew-d, speech_to_text, speecht5, swin, swinv2, table-transformer, timesformer, tvlt, unispeech, unispeech-sat, van, videomae, vilt, vit, vit_mae, vit_msn, wav2vec2, wav2vec2-conformer, wavlm, whisper, xclip, yolos' you just add the "feature_extractor_type": "ViTFeatureExtractor" sentence into preprocessor_config.json file

AceMcAwesome77 commented 7 months ago

dramab's solution "you just add the "feature_extractor_type": "ViTFeatureExtractor" sentence into preprocessor_config.json file" worked for me to avoid the error, however when I run image_captioner("sample_image.png") as the last step I just get a warning and no other output. What is the expected output of running this line? I just get "UserWarning: Using the model-agnostic default max_length (=20) to control the generation length. We recommend setting max_new_tokens to control the maximum length of the generation."

technoayan7 commented 5 months ago

@pleomax0730 can you provide me your colab please!

rohan9446 commented 5 months ago

@Aaryan562 did you find the solution for the error? I am also getting the same error.

AnhaarHussain commented 5 months ago

Hello Ankur,

Apologies for the delayed response; I couldn't resolve the issue despite attempting various solutions. Ultimately, I resorted to using a different model, though it too didn't achieve 100% accuracy. Nevertheless, we successfully incorporated it into our Final Year Project.

On Sun, Feb 25, 2024, 2:40 PM rohan9446 @.***> wrote:

@Aaryan562 https://github.com/Aaryan562 did you find the solution for the error? I am also getting the same error.

— Reply to this email directly, view it on GitHub https://github.com/Ankur3107/ankur3107.github.io/issues/2#issuecomment-1962889710, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOZD74VYZ2346VFHSEWU3IDYVMIKVAVCNFSM6AAAAAATC5XSUOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSHA4DSNZRGA . You are receiving this because you commented.Message ID: @.***>

arielshaulov commented 3 months ago

https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/

Hi, did u succeeded to solve that? I am trying to solve the exact same problem.

is124 commented 3 months ago

Hi @Ankur, if I want certain type of caption, can I provide prompt to the model ? I've been trying it but not able to get desired results.

sswaiting commented 6 days ago

dramab's solution "you just add the "feature_extractor_type": "ViTFeatureExtractor" sentence into preprocessor_config.json file" worked for me to avoid the error, however when I run image_captioner("sample_image.png") as the last step I just get a warning and no other output. What is the expected output of running this line? I just get "UserWarning: Using the model-agnostic default max_length (=20) to control the generation length. We recommend setting max_new_tokens to control the maximum length of the generation."

you may create a variable to keep the result and then print it out result = image_captioner("sample_image.png") print(result)

Ankur3107 / ankur3107.github.io

blogs/the-illustrated-image-captioning-using-transformers/ #2

The Illustrated Image Captioning using transformers - Ankur NLP Enthusiast

To train on a large set, you can use a torch data iterator.