Closed taokz closed 2 years ago
Hi, could you please provide more details about your experimental setting? I guess you are trying to perform OFA pretraining. For pretraining, according to the readme, the data files vision_language_examples.tsv
(for multi-modal pretraining tasks), text_examples.tsv
(for text-only pretraining tasks), image_examples.tsv
(for image-only pretraining tasks) and detection_examples.tsv
(for detection pretraining tasks) should be prepared to facilitate our multitask pretraining. These files have different schema as mentioned in the readme. (The schema you provide is for vision_language_examples.tsv
.) Have you prepared all these files or just some of them? The error is most likely to be related to image_examples.tsv
rather than vision_language_examples.tsv
which should contain images with their VQ-GAN codes. For this pretraining task, the images should be pre-resized into 256*256 and fed into the VQ-GAN to obtain the image code. For other pretraining tasks and for fine-tuning, as mentioned in #106 (which is related to image captioning finetuning), the resize will be done on-the-fly.
Thank you SO MUCH for your reply, I've resize images before feeding into VQ-GAN for image_examples.tsv
and it solves my previous problem. However, I encounter the other error -- RuntimeError: The size of tensor a (1025) must match the size of tensor b (1024) at non-singleton dimension 3
Traceback (most recent call last):
File "../../train.py", line 528, in RuntimeError: The size of tensor a (1025) must match the size of tensor b (1024) at non-singleton dimension 3
Thanks for pointing this out! After the image is resized to 256*256, you also need to crop the middle 128*128 part and feed it to VQ-GAN. Image Infilling is actually restoring the middle part of the image. We will update Readme later.
Hi, I've followed your instructions, 1) resize original images to be 256 256 and generated their base64 string -- image_string.tsv
; 2) crop middle part and generate its base64 string mid_image_string.tsv
; 3) use the base64 string of middle part and VQ-GAN to generate the image code mid_image_code.tsv
; 4) combine base64 string of 256 256 image and code of 128 * 128 as input image_examples.tsv
.
But I still have the RuntimeError: The size of tensor a (1025) must match the size of tensor b (1024) at non-singleton dimension 3
as described previously. I think the input format of my customized data is correct.
Thanks!
I'm a little confused, because a 128*128 image should get a code of 256 lengths. Do you check if the length of the code is 256?
According to the VQ-GAN code, the code_image_size is 256, but I got 1024. I am figuring out what's going on.
@taokz Hi, I think you should set --code_image_size=128
, otherwise the image will be resized to 256*256.
Thanks, it works!
Thank you for your amazing work.
When I use your code to pretrain the model, I got a runtime error as follows:
Traceback (most recent call last): File "../../train.py", line 528, in
cli_main()
File "../../train.py", line 521, in cli_main
distributed_utils.call_main(cfg, main)
File "/home/test/ofa/fairseq/fairseq/distributed/utils.py", line 374, in call_main
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
File "/home/test/ofa/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
main(cfg, kwargs)
File "../../train.py", line 161, in main
disable_iterator_cache=True,
File "/home/test/ofa/utils/checkpoint_utils.py", line 288, in load_checkpoint
epoch=1, load_dataset=True, passthrough_args
File "/home/test/ofa/trainer.py", line 666, in get_train_iterator
self.reset_dummy_batch(batch_iterator.first_batch)
File "/home/test/ofa/fairseq/fairseq/data/iterators.py", line 322, in first_batch
return self.collate_fn([self.dataset[i] for i in self.frozen_batches[0]])
File "/home/test/ofa/data/pretrain_data/unify_dataset.py", line 636, in collater
res_v2 = collate(samples_v2, pad_idx=self.src_dict.pad(), eos_idx=self.eos)
File "/home/test/ofa/data/pretrain_data/unify_dataset.py", line 70, in collate
patch_images = torch.stack([sample['patch_image'] for sample in samples], dim=0)
RuntimeError: stack expects each tensor to be equal size, but got [3, 256, 256] at entry 0 and [3, 320, 390] at entry 2
The mismatch size would also be [3, 320, 320] sometimes. I tried to use different datasets, but I got the same error, and this error only happens when I use my customized visualization_language.tsv. I followed readme to create the .tsv file as follows:
The base64 string is generated by your provided code as show in other issues.
Since the size of images vary, should I resize images before generating base64 string? However, according to issue 106, you state that
The resizing from the raw size to the specified resolution is done on the fly during training and inference in the getitem method of pytorch dataset.
I think you mean that I do not need to resize image, and your code will help me do it. Could you clarify this issue? I appreciate your help, thank you!