Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.54k stars 242 forks source link

Question about SD format dataset, multi-batch && variable length frame T (train) #255

Closed xmc-andy closed 12 months ago

xmc-andy commented 12 months ago

Thank you for your great work! I am doing a single prompt-multiple image input task, I have read all the questions and answers, and found similar questions, according to the question instructions, I try to convert the dataset to SD format, and when batchsize=1, the model can Training, but when batchsize is set to >1, the following error occurs:

batch["net_input"]["patch_images"] = torch.stack([sample["patch_images"] for sample in samples], dim=0) RuntimeError: stack expects each tensor to be equal size, but got [1, 1, 3, 224, 224] at entry 0 and [1, 2, 3, 224, 224] at entry 1

For variable-length image input, is there a way to set the batchsize to be greater than 1?

Luodian commented 12 months ago

hi, are you use your own dataset?

you can only stack two same shape tensor into a batch.

So you want to train them together, you may use different dataloader for variable length frame T. In our SD, the input are all in a two images format.

xmc-andy commented 12 months ago

Thanks for your quick reply! Yes, I am using my own dataset, and I am now trying to find a way to perform a collate_fn-like process on input images of different lengths to enable multi-batchsize processing. What is the impact of batchsize=1 and >1 on the training results in my case?