IndexError: too many indices for tensor of dimension 0

G-UX commented 3 years ago

Hi, after training with a custom dataset, in the evaluation stage I got this error:

Traceback (most recent call last): File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 636, in run_train self.train_loop.run_training_epoch() File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 578, in run_training_epoch self.trainer.run_evaluation(on_epoch=True) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in run_evaluation deprecated_eval_results = self.evaluation_loop.evaluation_epoch_end() File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 187, in evaluation_epoch_end deprecated_results = self.run_eval_epoch_end(self.num_dataloaders) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 225, in run_eval_epoch_end eval_results = model.validation_epoch_end(eval_results) File "/media/g-ux/Data/ComputerVision/Pytorch/RetinaNet/ForkONNX/retinanet-lightning/retinanet/models.py", line 320, in validation_epoch_end ~torch.isnan(avg_reg_loss) IndexError: too many indices for tensor of dimension 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 77, in main() File "train.py", line 68, in main trainer.fit(model, dm) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 498, in fit self.dispatch() File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in dispatch self.accelerator.start_training(self) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training self.training_type_plugin.start_training(trainer) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training self._results = trainer.run_train() File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 668, in run_train self.train_loop.on_train_end() File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 134, in on_train_end self.check_checkpoint_callback(should_update=True, is_last=True) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 164, in check_checkpoint_callback cb.on_validation_end(self.trainer, model) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 212, in on_validation_end self.save_checkpoint(trainer, pl_module) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 247, in save_checkpoint self._validate_monitor_key(trainer) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 495, in _validate_monitor_key raise MisconfigurationException(m) pytorch_lightning.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='COCO_eval/mAP@0.5:0.95:0.05') not found in the returned metrics: ['train/cls_loss', 'train/reg_loss']. HINT: Did you call self.log('COCO_eval/mAP@0.5:0.95:0.05', tensor) in the LightningModule?

I'm running under Ubuntu 18.04, CUDA 11.2, and dependencies settled in the requeriment.txt file.

bishwarup307 commented 3 years ago

Hi,

Thanks for the issue. Could you please provide me with a colab notebook to reproduce this issue?

G-UX commented 3 years ago

Sent by email. Thanks for your quick response!

bishwarup307 commented 3 years ago

Could you please share the config.yaml file you are using? Also, try turning off mixed precision with amp=false and see if the problem persists.

G-UX commented 3 years ago

Dataset: dataset: coco root: "/home/g-ux/COCO/" train_name: "train" val_name: "val" test_name: test image_size: [512, 512, 3] nsr: null Model: backbone: name: "resnet_50" pretrained: True freeze_bn: True anchors: scales: [1, 1.2599210498948732, 1.5874010519681994] ratios: [0.5, 1, 2] sizes: [32, 64, 128, 256, 512] strides: [8, 16, 32, 64, 128] prior_mean: null prior_std: null FPN: pyramid_levels: [3, 4, 5, 6, 7] channels: 256 upsample: "nearest" head: classification: num_classes: 1 n_repeat: 4 use_bn: False activation: 'relu' loss: name: "focalloss" params: alpha: 0.25 gamma: 2.0 bias_prior: 0.01 regression: n_repeat: 4 use_bn: False activation: 'relu' loss: name: "smooth_l1_loss" params: beta: 0.1 Trainer: logdir: "/home/g-ux/COCO/Log/" num_epochs: 10 batch_size: train: 8 val: 8 test: 8 optimizer: name: "torch.optim.Adam" params: betas: [0.9, 0.999] weight_decay: 1e-6 scheduler: name: "torch.optim.lr_scheduler.OneCycleLR" params: max_lr: 1e-5 anneal_strategy: 'cos' pct_start: 1e-5 div_factor: 10 final_div_factor: 1e2 tpus: 0 gpus: 1 dist_backend: 'ddp' workers: 8 clip_grad_norm: 0.1 amp: False amp_backend: "native" num_sanity_val_steps: 0 callbacks: checkpoint: enabled: True save_top_k: 3 monitor: 'COCO_eval/mAP@0.5:0.95:0.05' mode: 'max' verbose: False early_stopping: enabled: True monitor: 'COCO_eval/mAP@0.5:0.95:0.05' mode: 'max' patience: 5 verbose: False lr_monitor: enabled: True logging_interval: null save_val_predictions: True save_test_predictions: True

/----------------------------------------------------------------------------------/ Tried with amp=false in the 'augs' branch and the problem persist. Maybe it has to do with the versions I'm using? Here's my configuration:

Python 3.6
numpy 1.19.2
omegaconf 2.0.5
onnx 1.7.0
onnxruntime 1.5.1
opencv-python 4.4.0.46
opencv-python-headless 4.4.0.46
pytorch-lightning 1.2.4
tensorboard 2.4.1
torch 1.8.0
torchvision 0.9.0

bishwarup307 commented 3 years ago

Hi, I wanted to reproduce the issue on my side but I am unable to do the same. I tried a few other datasets as well but everything seems smooth.

Is it possible for you to share the training/val dataset you are using with me? My guess would be it is related to the particular dataset.

G-UX commented 3 years ago

Well sadly I can't share the dataset, but would you list your pip dependencies as I did? Maybe it has to do with it. About the dataset it contains small objects that I need to detect. It is possible that the bounding boxes are too small? Images are 900x600 and in each one there are more than one object.

Egorundel commented 9 months ago

@G-UX u can try change the .../retinanet-lightning/retinanet/models.py

line 319:

it: avg_reg_loss = avg_cls_loss[~torch.isnan(avg_reg_loss)].mean() # a batch with no annotation will likely result in nan reg_loss

on it: avg_reg_loss = avg_reg_loss[~torch.isnan(avg_reg_loss)].mean() # a batch with no annotation will likely result in nan reg_loss

bishwarup307 / retinanet-lightning

IndexError: too many indices for tensor of dimension 0 #3