How to calculate the number of data in the cc_sbu and laion respectively?

Vision-CAIR / MiniGPT-4

Open-sourced codes for MiniGPT-4 and MiniGPT-v2 (https://minigpt-4.github.io, https://minigpt-v2.github.io/)

https://minigpt-4.github.io

BSD 3-Clause "New" or "Revised" License

25.36k stars 2.91k forks source link

How to calculate the number of data in the cc_sbu and laion respectively? #176

Open Richar-Du opened 1 year ago

Richar-Du commented 1 year ago

I download the cc_sbu dataset and count the number, I found that the total number is 12M and the success is more than 6M, which is impossible, since cc_sub+laion is just 5M as mentioned in your paper. Since webdataset is iterable dataloader, len is not implemented. I want to know how to calculate the number of data in the downloaded cc_sbu and laion?

TsuTikgiau commented 1 year ago

Hello! The whole dataset is large but we only use a small part of them. In our training setting for stage 1, we use 4 A100 80G, each of them has a batch size of 64. So the total batch size is 256. We train our model in the first stage for 20k steps. So the total data we consume in the first stage is 20k * 256 = 5.12M

Richar-Du commented 1 year ago

Thanks for your reply! So the first stage randomly sample the (image, caption) pairs from cc_sbu and laion dataset, the total number is calculated according to the training steps and batch size. WebDataset.Pipeline can guarantee that the sampled data are not repeated. Is it true?