Pre-training doesn't work

claudiogreco commented 4 years ago

Hello,

I am trying to run the pre-training of the model again. When I run the command: bash run/lxmert_pretrain.bash 1,2 --multiGPU --tiny

I get the following output:

Load 174866 data from mscoco_train,mscoco_nominival,vgnococo
Load an answer table of size 9500.
Start to load Faster-RCNN detected objects from data/mscoco_imgfeat/train2014_obj36.tsv
Loaded 500 images in file data/mscoco_imgfeat/train2014_obj36.tsv in 2 seconds.
Start to load Faster-RCNN detected objects from data/mscoco_imgfeat/val2014_obj36.tsv
Loaded 500 images in file data/mscoco_imgfeat/val2014_obj36.tsv in 2 seconds.
Start to load Faster-RCNN detected objects from data/vg_gqa_imgfeat/vg_gqa_obj36.tsv
Loaded 500 images in file data/vg_gqa_imgfeat/vg_gqa_obj36.tsv in 2 seconds.
Use 33226 data in torch dataset

Load 5000 data from mscoco_minival
Load an answer table of size 9500.
Start to load Faster-RCNN detected objects from data/mscoco_imgfeat/val2014_obj36.tsv
Loaded 500 images in file data/mscoco_imgfeat/val2014_obj36.tsv in 2 seconds.
Use 20707 data in torch dataset

LXRT encoder with 9 l_layers, 5 x_layers, and 5 r_layers.
Train from Scratch: re-initialize all BERT weights.
Batch per epoch: 129
Total Iters: 2580
Warm up Iters: 129
  0%|                                                                                                                                | 0/129 [00:00<?, ?it/s]/mnt/8tera/claudio.greco/bert_foil/lxmert/venv_lxmert/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '

and nothing else happens.

I guess I should see a progress bar or some intermediate information, right? Do you know how I could try to fix this issue?

Thanks, Claudio

airsplay commented 4 years ago

Thanks for your question. I just test the command provided in the homepage once more and here is the log: The command runs successfully. It might take a few seconds to set up the multi-GPU envs. Please also help halve the batch_size since only 2 GPUs are used (but I think that it might not be the reason).

Could you also help provide the GPU/CUDA/NVCC versions if the problem is not solved?

claudiogreco commented 4 years ago

Thanks for your answer. After having executed the command, I waited for about an hour, but I didn't see any progress bar. The server has three GPUs:

GPU 0: Quadro P6000 (24GB)
GPU 1: Tesla K80 (12GB)
GPU 2: Tesla K80 (12GB)

However, I had executed the command to use only GPUs 1 and 2. The version of CUDA is 8.0.

airsplay commented 4 years ago

Thanks. May I ask the version of your PyTorch library? Since PyTorch > 1.0.0 is compiled with CUDA > 8.0 (on PyPI), thus I am wondering whether these two libraries are compatible.

claudiogreco commented 4 years ago

I am using PyTorch 1.3.1. Good point. I will talk with the server administrators and we will see whether the problem will be solved by updating CUDA. Thanks.

claudiogreco commented 4 years ago

P.S.: I noticed that the code works if I use only one GPU. I guess that as you also said it should be an issue with the drivers of my server. I will check this with administrators. Thanks for your help!

airsplay / lxmert

Pre-training doesn't work #36