YiLunLee / missing_aware_prompts

Multimodal Prompting with Missing Modalities for Visual Recognition, CVPR'23
https://yilunlee.github.io/missing_aware_prompts/
160 stars 9 forks source link

integer division or modulo by zero #11

Closed yan9qu closed 1 year ago

yan9qu commented 1 year ago

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run.py with data_root=/data1/yq/004_intention/missing_aware_prompts/datasets/mmimdb num_gpus=8 num_nodes=1 per_gpu_batchsize=64 task_finetune_mmimdb load_path=/data1/yq/004_intention/missing_aware_prompts/vilt/models/vilt_200k_mlm_itm.ckpt exp_name='test1'

ERROR - ViLT - Failed after 0:00:21! ERROR - ViLT - Failed after 0:00:17! Traceback (most recent calls WITHOUT Sacred internals): File "/data1/yq/004_intention/missing_aware_prompts/run.py", line 75, in main trainer.fit(model, datamodule=dm) File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit results = self.accelerator_backend.train() File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train results = self.ddp_train(process_idx=self.task_idx, model=model) File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 286, in ddp_train self.setup_optimizers(model) File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 145, in setup_optimizers optimizers, lr_schedulers, optimizer_frequencies = self.trainer.init_optimizers(model) File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/optimizers.py", line 31, in init_optimizers optim_conf = model.configure_optimizers() File "/data1/yq/004_intention/missing_aware_prompts/vilt/modules/vilt_missing_aware_prompt_module.py", line 366, in configure_optimizers return vilt_utils.set_schedule(self) File "/data1/yq/004_intention/missing_aware_prompts/vilt/modules/vilt_utils.py", line 323, in set_schedule // pl_module.trainer.accumulate_grad_batches ZeroDivisionError: integer division or modulo by zero

Traceback (most recent calls WITHOUT Sacred internals): File "/data1/yq/004_intention/missing_aware_prompts/run.py", line 75, in main trainer.fit(model, datamodule=dm) File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit results = self.accelerator_backend.train() File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train results = self.ddp_train(process_idx=self.task_idx, model=model) File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 286, in ddp_train self.setup_optimizers(model) File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 145, in setup_optimizers optimizers, lr_schedulers, optimizer_frequencies = self.trainer.init_optimizers(model) File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/optimizers.py", line 31, in init_optimizers optim_conf = model.configure_optimizers() File "/data1/yq/004_intention/missing_aware_prompts/vilt/modules/vilt_missing_aware_prompt_module.py", line 366, in configure_optimizers return vilt_utils.set_schedule(self) File "/data1/yq/004_intention/missing_aware_prompts/vilt/modules/vilt_utils.py", line 323, in set_schedule // pl_module.trainer.accumulate_grad_batches ZeroDivisionError: integer division or modulo by zero

ERROR - ViLT - Failed after 0:00:28! Traceback (most recent calls WITHOUT Sacred internals): File "/data1/yq/004_intention/missing_aware_prompts/run.py", line 75, in main trainer.fit(model, datamodule=dm) File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit results = self.accelerator_backend.train() File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train results = self.ddp_train(process_idx=self.task_idx, model=model) File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 286, in ddp_train self.setup_optimizers(model) File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 145, in setup_optimizers optimizers, lr_schedulers, optimizer_frequencies = self.trainer.init_optimizers(model) File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/optimizers.py", line 31, in init_optimizers optim_conf = model.configure_optimizers() File "/data1/yq/004_intention/missing_aware_prompts/vilt/modules/vilt_missing_aware_prompt_module.py", line 366, in configure_optimizers return vilt_utils.set_schedule(self) File "/data1/yq/004_intention/missing_aware_prompts/vilt/modules/vilt_utils.py", line 323, in set_schedule // pl_module.trainer.accumulate_grad_batches ZeroDivisionError: integer division or modulo by zero

ERROR - ViLT - Failed after 0:00:35! Traceback (most recent calls WITHOUT Sacred internals): File "run.py", line 75, in main trainer.fit(model, datamodule=dm) File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit results = self.accelerator_backend.train() File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train results = self.ddp_train(process_idx=self.task_idx, model=model) File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 286, in ddp_train self.setup_optimizers(model) File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 145, in setup_optimizers optimizers, lr_schedulers, optimizer_frequencies = self.trainer.init_optimizers(model) File "/home/yangqu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/optimizers.py", line 31, in init_optimizers optim_conf = model.configure_optimizers() File "/data1/yq/004_intention/missing_aware_prompts/vilt/modules/vilt_missing_aware_prompt_module.py", line 366, in configure_optimizers return vilt_utils.set_schedule(self) File "/data1/yq/004_intention/missing_aware_prompts/vilt/modules/vilt_utils.py", line 323, in set_schedule

YiLunLee commented 1 year ago

Thank you for your question.

In this work, I use the accumulate gradient technique to enable the training with the large batch size. By default, the batch size is set to 256. In your case, you have 8 GPU cards and per_gpu_batchsize=64, getting 512 samples for a batch in total. This exceeds the determined batch size. To solve this, you can increase the batch_size or reduce the per_gpu_batchsize or num_gpus or num_nodes.