airsplay / lxmert

PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".
MIT License
935 stars 158 forks source link

CPU memory usage is too high and other queries #37

Closed prajjwal1 closed 4 years ago

prajjwal1 commented 4 years ago

Thanks for sharing this code. When I'm performing finetuning with VQA, my RAM usage blows up. With num_workers set to 4, it requires 207 GB. I've tried with different batch sizes also. The script with --tiny flag runs successfully. But when I'm loading both train and nominival, the memory usage blows up. I get memory can't be allocated. Do you know a workaround for this ? I think this is because we are storing all the features from faster_rcnn in RAM ?

airsplay commented 4 years ago

Thanks.

It is because the data loader would load image features to memory 4 times (i.e., 4 copies of the image features are in memory).

The num_workers setting in PyTorch data loader would only speed up memory-loading because due to GIL. Since the image features are already in memory, the num_workers > 1 is not needed. Setting num_workers==1 is efficient.

prajjwal1 commented 4 years ago

Thanks for replying.

  1. Why 4 ? Are you using multi scaling ? Like 4 features extracted from intermediate layers from the backbone.
  2. Did you train the faster_rcnn on the COCO with cross entropy only ? Or did you use the five mentioned pretraining objectives with BERT for training it ? I'm asking because I want to extend your approach outside VQA, and I will have to train faster_rcnn myself since features won't be available.
airsplay commented 4 years ago
  1. It is because you have set up "num_workers" to 4, thus it will be 4 copies.

  2. The faster_rcnn is only trained with object detection and is frozen when extracting features.

prajjwal1 commented 4 years ago

Could you suggest a better way of loading features ? I'm not able to fit them even with num_workers=1. Should I use faster_rcnn from torchvision (it is also pretrained on COCO) for VQA and obtain features on the fly ? But LXMERT+faster_rcnn won't fit on a single GPU ? How did you manage ?

airsplay commented 4 years ago

May I ask how large your main memory is? You might use the fp16 branch to halve the memory usage with command

git checkout fp16

If it still exceeds the memory limitation, the code might need to load features from disk.

prajjwal1 commented 4 years ago

I have 8 cores in my GCP instance (around 48 GB). It just tried with with num_workers=0 though (it worked, but slow). I am thinking of using pandas with chunksize. This implementation of reading from a very large tsv file at once seems inefficient. What do you think ?

airsplay commented 4 years ago

Thanks. WIth 48 GB main memory, it should work fine with fp16, which takes around 30 GB memory.

Loading features from disk is definitely possible; Multiple workers should be involved to balance the loading. The current choice loads from memory thus ultimately remove the cost of memory loading but would be inefficient when the memory is not big enough.

prajjwal1 commented 4 years ago

Thanks for your reply. In your experiment, did you try with other configurations (9,6,6 for L,X,R ) ?

airsplay commented 4 years ago

I did not try them given the computational resources.

prajjwal1 commented 4 years ago

Thanks.

  1. For pretraining, you used all 5 datasets (Visual Genome, MS COCO, VQA 2.0, GQA, VG-QA), I wanted to ask for finetuning on VQA, the model will be coming across same sentence and image pairs as it encountered during pretraining right ? So during finetuning, is the model already been exposed to the dataset being used for finetuning ?

  2. You don't seem to be using lr_scheduler, any reason for that ?

airsplay commented 4 years ago
  1. May I ask whether you would consider using VQA in pre-training as a problematic setup? And could you specify the reason? Actually, using part of data in pre-training is a common strategy since the limitation of data. As long as it does not touch the test data, the improvement on the test data would be considered as solid. For example, every work following bottom-up attention (if you are not familiar with the thread of VQA works, they generally include every recent works on VQA in the last two years) takes an object detection pretrained on Visual Genome. Visual genomes contain half of VQA training images, which means that the ground truth objects annotation of the training image is used in the pre-training of every VQA paper. However, the test data have never been touched in training the detection system thus the validity of these works still holds.

  2. It has a triangular lr scheduler inside the optimizer.

prajjwal1 commented 4 years ago

Thanks for replying. Sorry, I just wanted to learn more about the pretraining and I don't consider it problematic. The reason why I asked is that pretraining isn't feasible for me right now.

  1. I am working on the finetuning part specifically, and I think the effect of finetuning will not be predominant if the model has already seen the data during pretraining. Although the performance will improve as your paper (and other works e.g ViLBERT, VLBERT, UNITER) have shown, I think there would some upper bound imposed on the performance to some extent (Moreover, these datasets have some overlap). For ex. using non COCO images and pairs (along with your proposed objectives) for pretraining, and using VQA (which has COCO images) to finetune (similar to what we do in imagenet training).

  2. Thanks for providing such a wonderful codebase, really helpful. I wanted to clarify if the pretrained model provided by you has been trained on all 4 objectives (Image QA, XM matching, Masked obj pred, Masked XM LM) ?

airsplay commented 4 years ago
  1. ViLBERT is mainly trained with Conceptual Captions, which contains (mostly) out-of-domain images and data. However, another paper, i.e. UNITER, somehow shows that a clean dataset would still be better in handling vision-and-language tasks (see Table 3). On out-of-domain datasets (NLVR2), the COCO + VG setup still wins although Conceptual Captions has 10x more images. I currently do not have a clear answer for it and I am waiting for more results. A clear comparison between clean/in-domain/small datasets and noisy/out-of-domain/large datasets requires too many computational resources.

  2. Yes. The losses are added together.

prajjwal1 commented 4 years ago

Hi, In the paper, you've reported results on test set. But predict provides results on validation (this repo). How did you calculate results on test-dev, test-std ?

airsplay commented 4 years ago

Thanks, the results of test-dev and test-std require using the test servers. The detailed processes for each dataset are provided at the end of each section. E.g., https://github.com/airsplay/lxmert#submitted-to-vqa-test-server, https://github.com/airsplay/lxmert#submitted-to-gqa-test-server, https://github.com/airsplay/lxmert#unreleased-test-sets

prajjwal1 commented 4 years ago

Hi, Could you please share the attention visualization code (as in appendix section of your paper)? It seems very useful for interpretability point of view. That'd be really useful.

airsplay commented 4 years ago

Currently, I did not find a clean way to fetch the attention graphs thus the code is badly organized. I just gathered all the output, saved them in tsv files, and visualize the output by ipynb. So for now, I do not have a plan to release them.

prajjwal1 commented 4 years ago

How did you gathered the output, if you could please point the line in your current codebase, where you got the output from, that'd be really helpful.

airsplay commented 4 years ago

My way is simple but not elegant. I create a global list and append the output here to the list. The list is cleared before each forward operation and logged after the forward.