Closed prajjwal1 closed 4 years ago
Thanks.
It is because the data loader would load image features to memory 4 times (i.e., 4 copies of the image features are in memory).
The num_workers
setting in PyTorch data loader would only speed up memory-loading because due to GIL. Since the image features are already in memory, the num_workers
> 1 is not needed. Setting num_workers==1
is efficient.
Thanks for replying.
It is because you have set up "num_workers" to 4, thus it will be 4 copies.
The faster_rcnn is only trained with object detection and is frozen when extracting features.
Could you suggest a better way of loading features ? I'm not able to fit them even with num_workers=1
. Should I use faster_rcnn from torchvision (it is also pretrained on COCO) for VQA and obtain features on the fly ? But LXMERT+faster_rcnn won't fit on a single GPU ? How did you manage ?
May I ask how large your main memory is?
You might use the fp16
branch to halve the memory usage with command
git checkout fp16
If it still exceeds the memory limitation, the code might need to load features from disk.
I have 8 cores in my GCP instance (around 48 GB). It just tried with with num_workers=0
though (it worked, but slow). I am thinking of using pandas with chunksize. This implementation of reading from a very large tsv file at once seems inefficient. What do you think ?
Thanks. WIth 48 GB main memory, it should work fine with fp16
, which takes around 30 GB memory.
Loading features from disk is definitely possible; Multiple workers should be involved to balance the loading. The current choice loads from memory thus ultimately remove the cost of memory loading but would be inefficient when the memory is not big enough.
Thanks for your reply. In your experiment, did you try with other configurations (9,6,6 for L,X,R ) ?
I did not try them given the computational resources.
Thanks.
For pretraining, you used all 5 datasets (Visual Genome, MS COCO, VQA 2.0, GQA, VG-QA), I wanted to ask for finetuning on VQA, the model will be coming across same sentence and image pairs as it encountered during pretraining right ? So during finetuning, is the model already been exposed to the dataset being used for finetuning ?
You don't seem to be using lr_scheduler
, any reason for that ?
May I ask whether you would consider using VQA in pre-training as a problematic setup? And could you specify the reason? Actually, using part of data in pre-training is a common strategy since the limitation of data. As long as it does not touch the test data, the improvement on the test data would be considered as solid. For example, every work following bottom-up attention (if you are not familiar with the thread of VQA works, they generally include every recent works on VQA in the last two years) takes an object detection pretrained on Visual Genome. Visual genomes contain half of VQA training images, which means that the ground truth objects annotation of the training image is used in the pre-training of every VQA paper. However, the test data have never been touched in training the detection system thus the validity of these works still holds.
It has a triangular lr scheduler inside the optimizer.
Thanks for replying. Sorry, I just wanted to learn more about the pretraining and I don't consider it problematic. The reason why I asked is that pretraining isn't feasible for me right now.
I am working on the finetuning part specifically, and I think the effect of finetuning will not be predominant if the model has already seen the data during pretraining. Although the performance will improve as your paper (and other works e.g ViLBERT, VLBERT, UNITER) have shown, I think there would some upper bound imposed on the performance to some extent (Moreover, these datasets have some overlap). For ex. using non COCO images and pairs (along with your proposed objectives) for pretraining, and using VQA (which has COCO images) to finetune (similar to what we do in imagenet training).
Thanks for providing such a wonderful codebase, really helpful. I wanted to clarify if the pretrained model provided by you has been trained on all 4 objectives (Image QA, XM matching, Masked obj pred, Masked XM LM) ?
ViLBERT is mainly trained with Conceptual Captions, which contains (mostly) out-of-domain images and data. However, another paper, i.e. UNITER, somehow shows that a clean dataset would still be better in handling vision-and-language tasks (see Table 3). On out-of-domain datasets (NLVR2), the COCO + VG setup still wins although Conceptual Captions has 10x more images. I currently do not have a clear answer for it and I am waiting for more results. A clear comparison between clean/in-domain/small datasets and noisy/out-of-domain/large datasets requires too many computational resources.
Yes. The losses are added together.
Hi,
In the paper, you've reported results on test set. But predict
provides results on validation (this repo). How did you calculate results on test-dev, test-std ?
Thanks, the results of test-dev and test-std require using the test servers. The detailed processes for each dataset are provided at the end of each section. E.g., https://github.com/airsplay/lxmert#submitted-to-vqa-test-server, https://github.com/airsplay/lxmert#submitted-to-gqa-test-server, https://github.com/airsplay/lxmert#unreleased-test-sets
Hi, Could you please share the attention visualization code (as in appendix section of your paper)? It seems very useful for interpretability point of view. That'd be really useful.
Currently, I did not find a clean way to fetch the attention graphs thus the code is badly organized. I just gathered all the output, saved them in tsv files, and visualize the output by ipynb. So for now, I do not have a plan to release them.
How did you gathered the output, if you could please point the line in your current codebase, where you got the output from, that'd be really helpful.
Thanks for sharing this code. When I'm performing finetuning with VQA, my RAM usage blows up. With
num_workers
set to 4, it requires 207 GB. I've tried with different batch sizes also. The script with--tiny
flag runs successfully. But when I'm loading bothtrain and nominival
, the memory usage blows up. I get memory can't be allocated. Do you know a workaround for this ? I think this is because we are storing all the features from faster_rcnn in RAM ?