Is there a toy/small dataset provided for study purpose

Cadene commented 7 years ago

The VQA dataset is the most widely used dataset by the visual question answering community, as it has the largest volume of human-annotated open-ended answers. Other datasets such as DAQAR, COCO-QA or Visual 7W are limited in terms of size and annotation quality. These limitations make them less relevant for evaluation of multimodal fusion models than the VQA dataset. And we do not provide an implementation for those datasets (feel free to contact us, if you need those datasets).

VQA 1.0 is made of several splits: train, val, test-std (including test-dev) The biggest models are trained on the train + val splits as training set and the test-dev split is used for validation (on the evaluation server). Thus for study purpose, the smallest datasets provided in this repo are the train split as trainset and val split as valset. You can train/eval a model on this split using the trainsplit: train option. See mutan_noatt_train.yaml

SeekPoint commented 7 years ago

would you like share your hardware system configuration like memory/GPU, and how many time you take on training, thanks.

Cadene commented 7 years ago

My server is made of 4 GTX1070 (8GB each) + 64 GB RAM + 2 Intel Xeon E5-2620V4 (2.1 GHz) + 1TB SSD PCIe + 1TB SSD SATA.

However, to be able to train one model with Attention with a tiny data loading time, you will need one GPU pytorch compatible, 3 threads and 1 * SSD SATA of 500GB devoted to storing data and only data (WARNING: not OS).

With a Pascal GPU on VQA 1.0 (VQA 2.0 will be added soon - it has twice the number of question/answer but same images):

on train/val:
- with no att model: 5 hours
- with att model: 10 hours
on trainval/testdev:
- with no att model: 10 hours
- with att model: 20 hours
on trainval+visual_genome/testdev (visual genome will be added soon):
- with no att model: 1 day
- with att model: 2 days

SeekPoint commented 7 years ago

I guess the training time excludes the process of generating the features of Resnet-152

Cadene commented 7 years ago

Generating the train features takes 30 mn.

By the way the features used in our paper are available https://github.com/Cadene/vqa.pytorch#features

SeekPoint commented 7 years ago

I under the impression that VQA task always take weeks training. OK, I will try it later.

ahmedmagdiosman commented 7 years ago

Hey, why is the SSD required to only store data? Thanks for providing your code!

Cadene commented 7 years ago

@ahmedmagdiosman I tried to load data from the SSD I use as boot drive, and got high data loading time when training models with Attention (for instance the OS may sometimes write on it or does some blocking stuff). In fact, you need to load data of dim: (batch_size x 2048 x 14 x 14) which is really big. However, the models without Attention (NoAtt) necessitate to load data of dim: (batch_size x 2048). So it's ok :)

An SSD is not required to run the models with Attention, but if you have high data loading time, be sure to use monitoring tools such as atop or htop to locate the bottleneck.

Lastly I suspect that h5py/HDF5 is not well suited for this kind of read intensive tasks. In fact, it seemed to work better in my old Torch7 code with torchnet.IndexedDataset. If I had the time, I would compare h5py/HDF5 and LMDB

ahmedmagdiosman commented 7 years ago

@Cadene Thanks a lot!

I actually had some really slow loading time with the Torch7 code from the MLB paper, I suspect it's also the HDF5 format since I didn't have a problem loading *.npy files in pycaffe in the original MCB code.

ili3p commented 7 years ago

In my experience, the best is to use pretrained caffe model, eg use MCB code and store the tensors as compressed numpy arrays. In this case it only takes 19gb for the whole train set so you can cache it in RAM.

For torch example see github.com/ilija139/vqa-soft

Cadene / vqa.pytorch

Is there a toy/small dataset provided for study purpose #4