Open utterances-bot opened 1 year ago
import torch
from PIL import Image
class ImageCapatioingDataset(torch.utils.data.Dataset):
def __init__(self, ds, ds_type, max_target_length):
self.ds = ds
self.max_target_length = max_target_length
self.ds_type = ds_type
def __getitem__(self, idx):
image_path = self.ds[self.ds_type]['image_path'][idx]
caption = self.ds[self.ds_type]['caption'][idx]
model_inputs = dict()
model_inputs['labels'] = self.tokenization_fn(caption, self.max_target_length)
model_inputs['pixel_values'] = self.feature_extraction_fn(image_path)
return model_inputs
def __len__(self):
return len(self.ds[self.ds_type])
# text preprocessing step
def tokenization_fn(self, caption, max_target_length):
"""Run tokenization on caption."""
labels = tokenizer(caption,
padding="max_length",
max_length=max_target_length).input_ids
return labels
# image preprocessing step
def feature_extraction_fn(self, image_path):
"""
Run feature extraction on images
If `check_image` is `True`, the examples that fails during `Image.open()` will be caught and discarded.
Otherwise, an exception will be thrown.
"""
image = Image.open(image_path)
encoder_inputs = feature_extractor(images=image, return_tensors="np")
return encoder_inputs.pixel_values[0]
train_ds = ImageCapatioingDataset(ds, 'train', 64)
eval_ds = ImageCapatioingDataset(ds, 'validation', 64)
# instantiate trainer
trainer = Seq2SeqTrainer(
model=model,
tokenizer=feature_extractor,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_ds,
eval_dataset=eval_ds,
data_collator=default_data_collator,
)
Hi thanks for the tutorial!
I am trying to run your code but it throws an error that maybe you could help me out with. The source of the error seems to be the image feature extractor. The error occurs when normalizing the images: image_utils.py
line 237. The error occurs when processing image_id 573223, that strangely has dimensions (224,224) while the rest of the images (processed up to the point where the error occurs) have dimensions (3,224,224).
Error:
ValueError: operands could not be broadcast together with shapes (224,224) (3,)
Thanks in advance!
@yesidc I ran the code in colab without problem.
I finally managed to run it. For one thing, the preprocessing
function provided by the transformers
API throws an error when processing black and white images (it always expects 3-channel images). I had to override this function and now the code works (I am using transformers 4.25.1
).
Unrecognized feature extractor in /content/image-captioning-output. Should have a feature_extractor_type
key in its preprocessor_config.json of config.json, or one of the following model_type
keys in its config.json: audio-spectrogram-transformer, beit, chinese_clip, clip, clipseg, conditional_detr, convnext, cvt, data2vec-audio, data2vec-vision, deformable_detr, deit, detr, dinat, donut-swin, dpt, flava, glpn, groupvit, hubert, imagegpt, layoutlmv2, layoutlmv3, levit, maskformer, mctct, mobilenet_v1, mobilenet_v2, mobilevit, nat, owlvit, perceiver, poolformer, regnet, resnet, segformer, sew, sew-d, speech_to_text, swin, swinv2, table-transformer, timesformer, unispeech, unispeech-sat, van, videomae, vilt, vit, vit_mae, vit_msn, wav2vec2, wav2vec2-conformer, wavlm, whisper, xclip, yolos
I am getting this kind of error why is it so?
The error is occurring in the inference stage when i am trying to load the pipeline.
Hi Ankur, what if we want multiple captions of the same image?
Hi Ankur, i want to do something between encoder and decoder, so i define the model as follows:
class caption_model(nn.Module):
def __init__(self, args):
super(caption_model, self).__init__()
self.args = args
self.gpt2_type = self.args.gpt2_type
self.config = GPT2Config.from_pretrained('./gpt/' + self.gpt2_type)
self.config.add_cross_attention = True
# self.config.is_decoder = True
self.config.is_encoder_decoder = True
self.encoder = ViTModel.from_pretrained('./vit', local_files_only=True)
self.decoder = GPT2LMHeadModel.from_pretrained('./gpt/'+self.gpt2_type, config=self.config)
def forward(self, pixel_values, input_ids):
image_feat = self.encoder(pixel_values)
encoder_outputs = image_feat.last_hidden_state
# encoder_outputs = do something
output = self.decoder(input_ids=input_ids, encoder_hidden_states=encoder_outputs)
return output.logits
while i get some throuble at the inference stage. it seems i should set is_encoder_decoder = True to use the "class BeamSearchEncoderDecoderOutput(ModelOutput):" in the generation_utils.py, but there will "torch.nn.modules.module.ModuleAttributeError: 'GPT2LMHeadModel' object has no attribute 'get_encoder'",
indeed, VisionEncoderDecoderModel complement the ViT-GPT2 for image captioning, but this is integrated, i couldn't do something between encoder an decoder, when i take it apart, i couldn't complete the beam_search stage, it seems impossible to rewrite the beam_search, do you have any suggestions or how should i set parameters to directly call generate().
thank you very much.
@newbietuan I think, you should ask this to huggingface https://github.com/huggingface/transformers/issues. They will give you better response. I will try, If I get anything will update you here.
@Aaryan562 There might me some transformers version issue.
@DeependraParichha1004
You may have to use combination of num_return_sequences, num_beams, penalty_alpha, top_k, top_p etc.
You can refer from:
from transformers import pipeline
image_to_text = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning")
generate_kwargs = {
"num_return_sequences":3,
"num_beams":3
}
image_to_text("https://ankur3107.github.io/assets/images/image-captioning-example.png", generate_kwargs=generate_kwargs)
@newbietuan I think, you should ask this to huggingface https://github.com/huggingface/transformers/issues. They will give you better response. I will try, If I get anything will update you here.
Thank you very much!
@Aaryan562 There might me some transformers version issue.
Are you sure that if I copy pasted each and every line would not give me any errors? or there are some changes
@Aaryan562 There might me some transformers version issue.
I also check the config.json and it had model_type key ='vit' in it then also it is giving value error
@Aaryan562 There might me some transformers version issue.
Can you also tell me how to resolve the version issue pls
got it @Ankur3107 thankyou for the explanation.
@Aaryan562 There might me some transformers version issue.
Can you also tell me how to resolve the version issue pls
Hi, I also have this issue. Have you found a solution?
@Aaryan562 There might me some transformers version issue.
Can you also tell me how to resolve the version issue pls
Hi, I also have this issue. Have you found a solution?
No, i have not are you also getting the error in the inference stage??
@Aaryan562 There might me some transformers version issue.
So, which version of Transformer should we use?
How to load a custom local dataset using the load_data() ? I have downloaded the flickr30k dataset which has images and captions in separate folders.
Hi, I also got the same error during inference stage ValueError: Unrecognized feature extractor in ./instagram-captioning-output. Should have a `feature_extractor_type` key in its preprocessor_config.json of config.json, or one of the following `model_type` keys in its config.json: audio-spectrogram-transformer, beit, chinese_clip, clap, clip, clipseg, conditional_detr, convnext, cvt, data2vec-audio, data2vec-vision, deformable_detr, deit, detr, dinat, donut-swin, dpt, flava, glpn, groupvit, hubert, imagegpt, layoutlmv2, layoutlmv3, levit, maskformer, mctct, mobilenet_v1, mobilenet_v2, mobilevit, nat, owlvit, perceiver, poolformer, regnet, resnet, segformer, sew, sew-d, speech_to_text, speecht5, swin, swinv2, table-transformer, timesformer, tvlt, unispeech, unispeech-sat, van, videomae, vilt, vit, vit_mae, vit_msn, wav2vec2, wav2vec2-conformer, wavlm, whisper, xclip, yolos
I resolved it by downgrading transfomers to !pip install transformers===4.28.0
, using python 3.9, and manually modify the existing model_type: "vision-encoder-decoder"
to model_type: "vit"
in the config.json
file. Not totally sure if this is correct, but it worked.
HELLO, i have an error model is not defined ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ in <cell line: 4>:5 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ NameError: name 'model' is not defined
Hi @Ankur, thanks for this amazing work. Is there a way to extract the probability for the predicted tokens in inference? Best,
@katiele47 this seems to solve it :)
I receive [{'generated_text': 'A train coming down the tracks in the city.'}] no matter the image. What am I missing? Any parameters for training I need to adjust? Many thanks!
I resolved it by downgrading transfomers to
!pip install transformers===4.28.0
, using python 3.9, and manually modify the existingmodel_type: "vision-encoder-decoder"
tomodel_type: "vit"
in theconfig.json
file. Not totally sure if this is correct, but it worked.
I meet the same problem . as your solve , turn out 'You are using a model of type vit to instantiate a model of type vision-encoder-decoder. This is not supported for all configurations of models and can yield errors.' question.
I find the solution about 'ValueError: Unrecognized feature extractor in ./instagram-captioning-output. Should have a feature_extractor_type
key in its preprocessor_config.json of config.json, or one of the following model_type
keys in its config.json: audio-spectrogram-transformer, beit, chinese_clip, clap, clip, clipseg, conditional_detr, convnext, cvt, data2vec-audio, data2vec-vision, deformable_detr, deit, detr, dinat, donut-swin, dpt, flava, glpn, groupvit, hubert, imagegpt, layoutlmv2, layoutlmv3, levit, maskformer, mctct, mobilenet_v1, mobilenet_v2, mobilevit, nat, owlvit, perceiver, poolformer, regnet, resnet, segformer, sew, sew-d, speech_to_text, speecht5, swin, swinv2, table-transformer, timesformer, tvlt, unispeech, unispeech-sat, van, videomae, vilt, vit, vit_mae, vit_msn, wav2vec2, wav2vec2-conformer, wavlm, whisper, xclip, yolos'
you just add the "feature_extractor_type": "ViTFeatureExtractor" sentence into preprocessor_config.json file
dramab's solution "you just add the "feature_extractor_type": "ViTFeatureExtractor" sentence into preprocessor_config.json file" worked for me to avoid the error, however when I run image_captioner("sample_image.png") as the last step I just get a warning and no other output. What is the expected output of running this line? I just get "UserWarning: Using the model-agnostic default max_length
(=20) to control the generation length. We recommend setting max_new_tokens
to control the maximum length of the generation."
@pleomax0730 can you provide me your colab please!
@Aaryan562 did you find the solution for the error? I am also getting the same error.
Hello Ankur,
Apologies for the delayed response; I couldn't resolve the issue despite attempting various solutions. Ultimately, I resorted to using a different model, though it too didn't achieve 100% accuracy. Nevertheless, we successfully incorporated it into our Final Year Project.
On Sun, Feb 25, 2024, 2:40 PM rohan9446 @.***> wrote:
@Aaryan562 https://github.com/Aaryan562 did you find the solution for the error? I am also getting the same error.
— Reply to this email directly, view it on GitHub https://github.com/Ankur3107/ankur3107.github.io/issues/2#issuecomment-1962889710, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOZD74VYZ2346VFHSEWU3IDYVMIKVAVCNFSM6AAAAAATC5XSUOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSHA4DSNZRGA . You are receiving this because you commented.Message ID: @.***>
https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/
Hi, did u succeeded to solve that? I am trying to solve the exact same problem.
Hi @Ankur, if I want certain type of caption, can I provide prompt to the model ? I've been trying it but not able to get desired results.
dramab's solution "you just add the "feature_extractor_type": "ViTFeatureExtractor" sentence into preprocessor_config.json file" worked for me to avoid the error, however when I run image_captioner("sample_image.png") as the last step I just get a warning and no other output. What is the expected output of running this line? I just get "UserWarning: Using the model-agnostic default
max_length
(=20) to control the generation length. We recommend settingmax_new_tokens
to control the maximum length of the generation."
you may create a variable to keep the result and then print it out result = image_captioner("sample_image.png") print(result)
The Illustrated Image Captioning using transformers - Ankur NLP Enthusiast
The Illustrated Image Captioning using transformers
https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/