OFA-Sys / OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Apache License 2.0
2.42k stars 248 forks source link

Custom vis-lan dataset - RuntimeError: stack expects each tensor to be equal size, but got [3, 256, 256] at entry 0 and [3, 320, 390] at entry 2 #125

Closed taokz closed 2 years ago

taokz commented 2 years ago

Thank you for your amazing work.

When I use your code to pretrain the model, I got a runtime error as follows:

Traceback (most recent call last): File "../../train.py", line 528, in cli_main() File "../../train.py", line 521, in cli_main distributed_utils.call_main(cfg, main) File "/home/test/ofa/fairseq/fairseq/distributed/utils.py", line 374, in call_main distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/home/test/ofa/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main main(cfg, kwargs) File "../../train.py", line 161, in main disable_iterator_cache=True, File "/home/test/ofa/utils/checkpoint_utils.py", line 288, in load_checkpoint epoch=1, load_dataset=True, passthrough_args File "/home/test/ofa/trainer.py", line 666, in get_train_iterator self.reset_dummy_batch(batch_iterator.first_batch) File "/home/test/ofa/fairseq/fairseq/data/iterators.py", line 322, in first_batch return self.collate_fn([self.dataset[i] for i in self.frozen_batches[0]]) File "/home/test/ofa/data/pretrain_data/unify_dataset.py", line 636, in collater res_v2 = collate(samples_v2, pad_idx=self.src_dict.pad(), eos_idx=self.eos) File "/home/test/ofa/data/pretrain_data/unify_dataset.py", line 70, in collate patch_images = torch.stack([sample['patch_image'] for sample in samples], dim=0) RuntimeError: stack expects each tensor to be equal size, but got [3, 256, 256] at entry 0 and [3, 320, 390] at entry 2

The mismatch size would also be [3, 320, 320] sometimes. I tried to use different datasets, but I got the same error, and this error only happens when I use my customized visualization_language.tsv. I followed readme to create the .tsv file as follows:

unique_id image (base64 string) caption question answer ground_truth objects dataset name task type

The base64 string is generated by your provided code as show in other issues.

Since the size of images vary, should I resize images before generating base64 string? However, according to issue 106, you state that

The resizing from the raw size to the specified resolution is done on the fly during training and inference in the getitem method of pytorch dataset.

I think you mean that I do not need to resize image, and your code will help me do it. Could you clarify this issue? I appreciate your help, thank you!

yangapku commented 2 years ago

Hi, could you please provide more details about your experimental setting? I guess you are trying to perform OFA pretraining. For pretraining, according to the readme, the data files vision_language_examples.tsv (for multi-modal pretraining tasks), text_examples.tsv (for text-only pretraining tasks), image_examples.tsv (for image-only pretraining tasks) and detection_examples.tsv (for detection pretraining tasks) should be prepared to facilitate our multitask pretraining. These files have different schema as mentioned in the readme. (The schema you provide is for vision_language_examples.tsv.) Have you prepared all these files or just some of them? The error is most likely to be related to image_examples.tsv rather than vision_language_examples.tsv which should contain images with their VQ-GAN codes. For this pretraining task, the images should be pre-resized into 256*256 and fed into the VQ-GAN to obtain the image code. For other pretraining tasks and for fine-tuning, as mentioned in #106 (which is related to image captioning finetuning), the resize will be done on-the-fly.

taokz commented 2 years ago

Thank you SO MUCH for your reply, I've resize images before feeding into VQ-GAN for image_examples.tsv and it solves my previous problem. However, I encounter the other error -- RuntimeError: The size of tensor a (1025) must match the size of tensor b (1024) at non-singleton dimension 3

Traceback (most recent call last): File "../../train.py", line 528, in cli_main() File "../../train.py", line 521, in cli_main distributed_utils.call_main(cfg, main) File "/home/kaz321/omnimed/fairseq/fairseq/distributed/utils.py", line 374, in call_main distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/home/kaz321/omnimed/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main main(cfg, kwargs) File "../../train.py", line 190, in main valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/home/kaz321/anaconda3/envs/omnimed/lib/python3.7/contextlib.py", line 74, in inner return func(*args, *kwds) File "../../train.py", line 301, in train log_output = trainer.train_step(samples) File "/home/kaz321/anaconda3/envs/omnimed/lib/python3.7/contextlib.py", line 74, in inner return func(args, kwds) File "/home/kaz321/omnimed/trainer.py", line 806, in train_step raise e File "/home/kaz321/omnimed/trainer.py", line 780, in train_step extra_kwargs, File "/home/kaz321/omnimed/tasks/ofa_task.py", line 318, in train_step loss, sample_size, logging_output = criterion(model, sample, update_num=update_num) File "/home/kaz321/anaconda3/envs/omnimed/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/home/kaz321/omnimed/criterions/label_smoothed_cross_entropy.py", line 180, in forward loss_v2, sample_size_v2, logging_output_v2 = self.forward(model, sample[1], update_num, reduce) File "/home/kaz321/omnimed/criterions/label_smoothed_cross_entropy.py", line 199, in forward net_output = model(sample["net_input"]) File "/home/kaz321/anaconda3/envs/omnimed/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/home/kaz321/omnimed/models/ofa/ofa.py", line 107, in forward return_all_hiddens=return_all_hiddens, File "/home/kaz321/anaconda3/envs/omnimed/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/kaz321/omnimed/models/ofa/unify_transformer.py", line 1169, in forward alignment_heads=alignment_heads, File "/home/kaz321/omnimed/models/ofa/unify_transformer.py", line 1193, in extract_features alignment_heads, File "/home/kaz321/omnimed/models/ofa/unify_transformer.py", line 1316, in extract_features_scriptable self_attn_bias[~code_masks] += self.get_rel_pos_bias(all_prev_output_tokens, idx).unsqueeze(0) RuntimeError: The size of tensor a (1025) must match the size of tensor b (1024) at non-singleton dimension 3

logicwong commented 2 years ago

Thanks for pointing this out! After the image is resized to 256*256, you also need to crop the middle 128*128 part and feed it to VQ-GAN. Image Infilling is actually restoring the middle part of the image. We will update Readme later.

CFCDA8C9-963B-4A43-A90E-A544030D4161

taokz commented 2 years ago

Hi, I've followed your instructions, 1) resize original images to be 256 256 and generated their base64 string -- image_string.tsv; 2) crop middle part and generate its base64 string mid_image_string.tsv; 3) use the base64 string of middle part and VQ-GAN to generate the image code mid_image_code.tsv; 4) combine base64 string of 256 256 image and code of 128 * 128 as input image_examples.tsv.

But I still have the RuntimeError: The size of tensor a (1025) must match the size of tensor b (1024) at non-singleton dimension 3 as described previously. I think the input format of my customized data is correct.

Thanks!

logicwong commented 2 years ago

I'm a little confused, because a 128*128 image should get a code of 256 lengths. Do you check if the length of the code is 256?

taokz commented 2 years ago

According to the VQ-GAN code, the code_image_size is 256, but I got 1024. I am figuring out what's going on.

image
logicwong commented 2 years ago

@taokz Hi, I think you should set --code_image_size=128, otherwise the image will be resized to 256*256.

taokz commented 2 years ago

Thanks, it works!