Open G-UX opened 3 years ago
Hi,
Thanks for the issue. Could you please provide me with a colab notebook to reproduce this issue?
Sent by email. Thanks for your quick response!
Could you please share the config.yaml
file you are using? Also, try turning off mixed precision with amp=false
and see if the problem persists.
Dataset: dataset: coco root: "/home/g-ux/COCO/" train_name: "train" val_name: "val" test_name: test image_size: [512, 512, 3] nsr: null Model: backbone: name: "resnet_50" pretrained: True freeze_bn: True anchors: scales: [1, 1.2599210498948732, 1.5874010519681994] ratios: [0.5, 1, 2] sizes: [32, 64, 128, 256, 512] strides: [8, 16, 32, 64, 128] prior_mean: null prior_std: null FPN: pyramid_levels: [3, 4, 5, 6, 7] channels: 256 upsample: "nearest" head: classification: num_classes: 1 n_repeat: 4 use_bn: False activation: 'relu' loss: name: "focalloss" params: alpha: 0.25 gamma: 2.0 bias_prior: 0.01 regression: n_repeat: 4 use_bn: False activation: 'relu' loss: name: "smooth_l1_loss" params: beta: 0.1 Trainer: logdir: "/home/g-ux/COCO/Log/" num_epochs: 10 batch_size: train: 8 val: 8 test: 8 optimizer: name: "torch.optim.Adam" params: betas: [0.9, 0.999] weight_decay: 1e-6 scheduler: name: "torch.optim.lr_scheduler.OneCycleLR" params: max_lr: 1e-5 anneal_strategy: 'cos' pct_start: 1e-5 div_factor: 10 final_div_factor: 1e2 tpus: 0 gpus: 1 dist_backend: 'ddp' workers: 8 clip_grad_norm: 0.1 amp: False amp_backend: "native" num_sanity_val_steps: 0 callbacks: checkpoint: enabled: True save_top_k: 3 monitor: 'COCO_eval/mAP@0.5:0.95:0.05' mode: 'max' verbose: False early_stopping: enabled: True monitor: 'COCO_eval/mAP@0.5:0.95:0.05' mode: 'max' patience: 5 verbose: False lr_monitor: enabled: True logging_interval: null save_val_predictions: True save_test_predictions: True
/----------------------------------------------------------------------------------/ Tried with amp=false in the 'augs' branch and the problem persist. Maybe it has to do with the versions I'm using? Here's my configuration:
Hi, I wanted to reproduce the issue on my side but I am unable to do the same. I tried a few other datasets as well but everything seems smooth.
Is it possible for you to share the training/val dataset you are using with me? My guess would be it is related to the particular dataset.
Well sadly I can't share the dataset, but would you list your pip dependencies as I did? Maybe it has to do with it. About the dataset it contains small objects that I need to detect. It is possible that the bounding boxes are too small? Images are 900x600 and in each one there are more than one object.
@G-UX u can try change the .../retinanet-lightning/retinanet/models.py
line 319:
it:
avg_reg_loss = avg_cls_loss[~torch.isnan(avg_reg_loss)].mean() # a batch with no annotation will likely result in nan reg_loss
on it:
avg_reg_loss = avg_reg_loss[~torch.isnan(avg_reg_loss)].mean() # a batch with no annotation will likely result in nan reg_loss
Hi, after training with a custom dataset, in the evaluation stage I got this error:
Traceback (most recent call last): File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 636, in run_train self.train_loop.run_training_epoch() File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 578, in run_training_epoch self.trainer.run_evaluation(on_epoch=True) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in run_evaluation deprecated_eval_results = self.evaluation_loop.evaluation_epoch_end() File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 187, in evaluation_epoch_end deprecated_results = self.run_eval_epoch_end(self.num_dataloaders) File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 225, in run_eval_epoch_end eval_results = model.validation_epoch_end(eval_results) File "/media/g-ux/Data/ComputerVision/Pytorch/RetinaNet/ForkONNX/retinanet-lightning/retinanet/models.py", line 320, in validation_epoch_end ~torch.isnan(avg_reg_loss) IndexError: too many indices for tensor of dimension 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "train.py", line 77, in
main()
File "train.py", line 68, in main
trainer.fit(model, dm)
File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 498, in fit
self.dispatch()
File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in dispatch
self.accelerator.start_training(self)
File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
self._results = trainer.run_train()
File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 668, in run_train
self.train_loop.on_train_end()
File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 134, in on_train_end
self.check_checkpoint_callback(should_update=True, is_last=True)
File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 164, in check_checkpoint_callback
cb.on_validation_end(self.trainer, model)
File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 212, in on_validation_end
self.save_checkpoint(trainer, pl_module)
File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 247, in save_checkpoint
self._validate_monitor_key(trainer)
File "/home/g-ux/.local/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 495, in _validate_monitor_key
raise MisconfigurationException(m)
pytorch_lightning.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='COCO_eval/mAP@0.5:0.95:0.05') not found in the returned metrics: ['train/cls_loss', 'train/reg_loss']. HINT: Did you call self.log('COCO_eval/mAP@0.5:0.95:0.05', tensor) in the LightningModule?
I'm running under Ubuntu 18.04, CUDA 11.2, and dependencies settled in the requeriment.txt file.