bishwarup307 / retinanet-lightning

Retinanet implementation in pytorch lightning
5 stars 0 forks source link

IndexError: too many indices for tensor of dimension 0 #3

Open G-UX opened 3 years ago

G-UX commented 3 years ago

Hi, after training with a custom dataset, in the evaluation stage I got this error:

Traceback (most recent call last): File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 636, in run_train self.train_loop.run_training_epoch() File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 578, in run_training_epoch self.trainer.run_evaluation(on_epoch=True) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in run_evaluation deprecated_eval_results = self.evaluation_loop.evaluation_epoch_end() File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 187, in evaluation_epoch_end deprecated_results = self.run_eval_epoch_end(self.num_dataloaders) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 225, in run_eval_epoch_end eval_results = model.validation_epoch_end(eval_results) File "/media/g-ux/Data/ComputerVision/Pytorch/RetinaNet/ForkONNX/retinanet-lightning/retinanet/models.py", line 320, in validation_epoch_end ~torch.isnan(avg_reg_loss) IndexError: too many indices for tensor of dimension 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 77, in main() File "train.py", line 68, in main trainer.fit(model, dm) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 498, in fit self.dispatch() File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in dispatch self.accelerator.start_training(self) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training self.training_type_plugin.start_training(trainer) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training self._results = trainer.run_train() File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 668, in run_train self.train_loop.on_train_end() File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 134, in on_train_end self.check_checkpoint_callback(should_update=True, is_last=True) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 164, in check_checkpoint_callback cb.on_validation_end(self.trainer, model) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 212, in on_validation_end self.save_checkpoint(trainer, pl_module) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 247, in save_checkpoint self._validate_monitor_key(trainer) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 495, in _validate_monitor_key raise MisconfigurationException(m) pytorch_lightning.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='COCO_eval/mAP@0.5:0.95:0.05') not found in the returned metrics: ['train/cls_loss', 'train/reg_loss']. HINT: Did you call self.log('COCO_eval/mAP@0.5:0.95:0.05', tensor) in the LightningModule?


I'm running under Ubuntu 18.04, CUDA 11.2, and dependencies settled in the requeriment.txt file.

bishwarup307 commented 3 years ago

Hi,

Thanks for the issue. Could you please provide me with a colab notebook to reproduce this issue?

G-UX commented 3 years ago

Sent by email. Thanks for your quick response!

bishwarup307 commented 3 years ago

Could you please share the config.yaml file you are using? Also, try turning off mixed precision with amp=false and see if the problem persists.

G-UX commented 3 years ago

Dataset: dataset: coco root: "/home/g-ux/COCO/" train_name: "train" val_name: "val" test_name: test image_size: [512, 512, 3] nsr: null Model: backbone: name: "resnet_50" pretrained: True freeze_bn: True anchors: scales: [1, 1.2599210498948732, 1.5874010519681994] ratios: [0.5, 1, 2] sizes: [32, 64, 128, 256, 512] strides: [8, 16, 32, 64, 128] prior_mean: null prior_std: null FPN: pyramid_levels: [3, 4, 5, 6, 7] channels: 256 upsample: "nearest" head: classification: num_classes: 1 n_repeat: 4 use_bn: False activation: 'relu' loss: name: "focalloss" params: alpha: 0.25 gamma: 2.0 bias_prior: 0.01 regression: n_repeat: 4 use_bn: False activation: 'relu' loss: name: "smooth_l1_loss" params: beta: 0.1 Trainer: logdir: "/home/g-ux/COCO/Log/" num_epochs: 10 batch_size: train: 8 val: 8 test: 8 optimizer: name: "torch.optim.Adam" params: betas: [0.9, 0.999] weight_decay: 1e-6 scheduler: name: "torch.optim.lr_scheduler.OneCycleLR" params: max_lr: 1e-5 anneal_strategy: 'cos' pct_start: 1e-5 div_factor: 10 final_div_factor: 1e2 tpus: 0 gpus: 1 dist_backend: 'ddp' workers: 8 clip_grad_norm: 0.1 amp: False amp_backend: "native" num_sanity_val_steps: 0 callbacks: checkpoint: enabled: True save_top_k: 3 monitor: 'COCO_eval/mAP@0.5:0.95:0.05' mode: 'max' verbose: False early_stopping: enabled: True monitor: 'COCO_eval/mAP@0.5:0.95:0.05' mode: 'max' patience: 5 verbose: False lr_monitor: enabled: True logging_interval: null save_val_predictions: True save_test_predictions: True

/----------------------------------------------------------------------------------/ Tried with amp=false in the 'augs' branch and the problem persist. Maybe it has to do with the versions I'm using? Here's my configuration:

bishwarup307 commented 3 years ago

Hi, I wanted to reproduce the issue on my side but I am unable to do the same. I tried a few other datasets as well but everything seems smooth.

Is it possible for you to share the training/val dataset you are using with me? My guess would be it is related to the particular dataset.

G-UX commented 3 years ago

Well sadly I can't share the dataset, but would you list your pip dependencies as I did? Maybe it has to do with it. About the dataset it contains small objects that I need to detect. It is possible that the bounding boxes are too small? Images are 900x600 and in each one there are more than one object.

Egorundel commented 10 months ago

@G-UX u can try change the .../retinanet-lightning/retinanet/models.py

line 319:

it: avg_reg_loss = avg_cls_loss[~torch.isnan(avg_reg_loss)].mean() # a batch with no annotation will likely result in nan reg_loss

on it: avg_reg_loss = avg_reg_loss[~torch.isnan(avg_reg_loss)].mean() # a batch with no annotation will likely result in nan reg_loss