dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
Apache License 2.0
1.36k stars 208 forks source link

python run.py with data_root=/data/workspace/dataset num_gpus=4 num_nodes=1 task_finetune_irtr_f30k_randaug per_gpu_batchsize=4 load_path="weights/vilt_200k_mlm_itm.ckpt" #4

Closed raojay7 closed 3 years ago

raojay7 commented 3 years ago

Saving latest checkpoint... INFO - lightning - Saving latest checkpoint... ERROR - ViLT - Failed after 1:05:38! Traceback (most recent calls WITHOUT Sacred internals): File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 524, in train self.train_loop.run_training_epoch() File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 572, in run_training_epoch batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx) File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 704, in run_training_batch self.trainer.hiddens) File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 818, in training_step_and_backward result = self.training_step(split_batch, batch_idx, opt_idx, hiddens) File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 339, in training_step training_step_output = self.trainer.accelerator_backend.training_step(args) File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 158, in training_step return self._step(args) File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 170, in _step output = self.trainer.model(args) File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, *kwargs) File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 179, in forward output = self.module.training_step(inputs[0], **kwargs[0]) File "/data/workspace/ViLT/vilt/modules/vilt_module.py", line 219, in training_step vilt_utils.set_task(self) File "/data/workspace/ViLT/vilt/modules/vilt_utils.py", line 177, in set_task picked = all_gather(current_tasks) File "/data/workspace/ViLT/vilt/modules/dist_utils.py", line 165, in all_gather size_list, tensor = _pad_to_largest_tensor(tensor, group) File "/data/workspace/ViLT/vilt/modules/dist_utils.py", line 129, in _pad_to_largest_tensor dist.all_gather(size_list, local_size, group=group) File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1870, in all_gather work.wait() RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete

During handling of the above exception, another exception occurred:

Traceback (most recent calls WITHOUT Sacred internals): File "/data/workspace/ViLT/run.py", line 72, in main trainer.fit(model, datamodule=dm) File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit results = self.accelerator_backend.train() File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train results = self.ddp_train(process_idx=self.task_idx, model=model) File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 305, in ddp_train results = self.train_or_test() File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 69, in train_or_test results = self.trainer.train() File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 555, in train self.train_loop.on_train_end() File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 200, in on_train_end self.check_checkpoint_callback(should_save=True, is_last=True) File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 234, in check_checkpoint_callback callback.on_validation_end(self.trainer, model) File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 203, in on_validation_end self.save_checkpoint(trainer, pl_module) File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 238, in save_checkpoint self._validate_monitor_key(trainer) File "/root/anaconda3/envs/vilt/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 516, in _validate_monitor_key raise MisconfigurationException(m) pytorch_lightning.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='val/the_metric') not found in the returned metrics: ['irtr/train/irtr_loss', 'itm/train/loss', 'itm/train/wpa_loss', 'itm/train/accuracy']. HINT: Did you call self.log('val/the_metric', tensor) in the LightningModule?

Epoch 0: 0%| | 24/9691 [30:14<202:59:08, 75.59s/it, loss=0.579, v_num=0]

dandelin commented 3 years ago

It seems all_gather didn't work correctly in your environment. Please check whether all_gather works properly.

Since it is a timeout error, it is probably raised because of the GPUs shared by multiple processes (=users). Make sure to assign GPUs to a single training script.

raojay7 commented 3 years ago

Thank you for your reply. I hope you can give more detailed running instructions in readme.

dandelin commented 3 years ago

Let me give you a more detailed explanation on the error; the second error pytorch_lightning.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='val/the_metric') not found in the returned metrics: ['irtr/train/irtr_loss', 'itm/train/loss', 'itm/train/wpa_loss', 'itm/train/accuracy']. It is caused by automatic checkpoint saving of PyTorch lightning (PL) as PL saves the last checkpoint when the training is exited by error. In this case, the checkpoint saving callback did not work since there was no val/the_metric, which not exists before the first validation.

So the main cause of the problem is from RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete, which somehow related to the communication between GPUs. I cannot give you a detailed reason since I do not know your running environment.

As I said in the previous comment, I guess that a GPU is hanged for some reason, making the other GPUs to be timed out. Check the processes using the GPUs. As I said, it is highly possible that other processes co-occupy the GPU.

dandelin commented 3 years ago

@raojay7 I found and fixed a potential bug that might cause your all_gather error. (https://github.com/dandelin/ViLT/commit/557b117b9663764574d5e198c90bc20de300417e)

Please try again and tell me whether the error still occurs.

raojay7 commented 3 years ago

@dandelin It works. Thank you! Now I run this program in 4*3090GPUs during fine-tuning the finetune_irtr_f30k_randaug experiment, but I have no idea that the training time will last and max_epoch|max_steps can be configured appropriately. And I want to know how you use the generated .ckpt for the training data analysis.

dandelin commented 3 years ago

@raojay7

raojay7 commented 3 years ago

@dandelin apologize. I mean the checkpoint is saved in the file and the log is not displayed in the printed information of the console or somewhere else. How can I query some parameters or visualizations in the training related information?

dandelin commented 3 years ago

@raojay7 You can visualize loss and other metrics using Tensorboard, which is saved along with the checkpoints. ex) tensorboard --logdir=result/finetune_irtr_f30k_randaug_seed0_from_vilt_200k_mlm_itm

raojay7 commented 3 years ago

@dandelin Thank you for your patient response!