Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.54k stars 242 forks source link

which dataset are used in train image-text model? #254

Open wjfwjfwjf opened 1 year ago

wjfwjfwjf commented 1 year ago

Before you open an issue, please check if a similar issue already exists or has been closed before.

When you open an issue, please be sure to include the following

Thank you for your contributions!

wjfwjfwjf commented 1 year ago

it seems like I cannot use DC and SD to train together? Traceback (most recent call last): File "/mnt/disk2/home/wujianfeng/otter/Otter/pipeline/train/instruction_following.py", line 837, in main() File "/mnt/disk2/home/wujianfeng/otter/Otter/pipeline/train/instruction_following.py", line 731, in main train_one_epoch( File "/mnt/disk2/home/wujianfeng/otter/Otter/pipeline/train/instruction_following.py", line 76, in train_one_epoch for num_steps, (batch_mimicits) in tqdm( File "/home/wujianfeng/miniconda3/envs/otter/lib/python3.9/site-packages/tqdm/std.py", line 1182, in iter for obj in iterable: File "/home/wujianfeng/miniconda3/envs/otter/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 633, in next data = self._next_data() File "/home/wujianfeng/miniconda3/envs/otter/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data return self._process_data(data) File "/home/wujianfeng/miniconda3/envs/otter/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data data.reraise() File "/home/wujianfeng/miniconda3/envs/otter/lib/python3.9/site-packages/torch/_utils.py", line 644, in reraise raise exception RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/wujianfeng/miniconda3/envs/otter/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) File "/home/wujianfeng/miniconda3/envs/otter/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch return self.collate_fn(data) File "/home/wujianfeng/otter/Otter/pipeline/mimicit_utils/mimicit_dataset.py", line 682, in collate res_v1 = collate_fn( File "/home/wujianfeng/otter/Otter/pipeline/mimicit_utils/mimicit_dataset.py", line 719, in collate_fn batch["net_input"]["patch_images"] = torch.stack([sample["patch_images"] for sample in samples], dim=0) RuntimeError: stack expects each tensor to be equal size, but got [1, 2, 3, 224, 224] at entry 0 and [1, 32, 3, 224, 224] at entry 4

Luodian commented 1 year ago

Yes, unless you set the batch_size to 1, you can not use image and video dataset together because they have different shapes.

Otter is designed to support multi-modal in-context instruction tuning based on the OpenFlamingo model, which involves conditioning the language model on the corresponding media, such as an image that corresponds to a caption or an instruction-response pair.

More specifically, Otter supports both image and video inputs, we interpret the input images in six dimensions as [B, N, T, C, W, H], where B represents the batch size, N and T correspond to the in-context images and video frames (with T=1 for images). The remaining dimensions, C,W,H, represent the image's channels, width, and height, respectively. By organizing the N and T dimensions, we can obtain in-context input samples tailored for both images and videos.

In your case, the T in SD and DC are different.

wjfwjfwjf commented 1 year ago

thank you, I want to know if I want to reproduce the result of image-text model, which dataset should I use?

Luodian commented 11 months ago

thank you, I want to know if I want to reproduce the result of image-text model, which dataset should I use?

hi do you mean what result? For short answer is you could only use LA subset to obtain good result, but it's not the best version of Otter.

We will have a revised Otter paper to demonstrate what datasets (more than 20 datasets we've tried) we used for instruction tuning and how do they help model gain knowledge and improve on different benchmarks.

gordonhu608 commented 6 months ago

So is there any update on this issue? what would be the best otter's data? Even for using solely the current MICMIC-IT data. I'm have been trying for so long and very confused about what data are IMAGE_TEXT, IMAGE_TEXT_IN_CONTEXT, and VIDEO_TEXT. Could there be a training data yaml just for the otter model be provided like the one in the Demo_Data.yaml. Thank you !