TRI-ML / packnet-sfm

TRI-ML Monocular Depth Estimation Repository
https://tri-ml.github.io/packnet-sfm/
MIT License
1.23k stars 242 forks source link

Getting an error while training during the validation step after 0th epoch! #157

Closed porwalnaman01 closed 3 years ago

porwalnaman01 commented 3 years ago

Hello there, loved your work and paper. I am facing an issue during the training process, it trains the model for the first epoch but during the validation step of first epoch it outputs an error. Hope you can help me here. I am using Pytorch 1.8.0 on ubuntu 18.04.

Epoch 0 | Avg.Loss 0.0849: 100%|###############################################| 5004/5004 [02:07<00:00, 39.24 images/s] KITTI_tiny-kitti_tiny-velodyne: 0%| | 0.00/5.00 [00:00<?, ? images/s] Traceback (most recent call last): File "scripts/train.py", line 68, in train(args.file) File "scripts/train.py", line 63, in train trainer.fit(model_wrapper) File "/disk1/dan/Naman/packnet-sfm/packnet_sfm/trainers/horovod_trainer.py", line 65, in fit validation_output = self.validate(val_dataloaders, module) File "/disk1/dan/Naman/packnet-sfm/packnet_sfm/trainers/horovod_trainer.py", line 120, in validate output = module.validation_step(batch, i, n) File "/disk1/dan/Naman/packnet-sfm/packnet_sfm/models/model_wrapper.py", line 194, in validation_step output = self.evaluate_depth(batch) File "/disk1/dan/Naman/packnet-sfm/packnet_sfm/models/model_wrapper.py", line 302, in evaluate_depth inv_depths[0], inv_depths_flipped[0], method='mean') File "/disk1/dan/Naman/packnet-sfm/packnet_sfm/utils/depth.py", line 247, in post_process_inv_depth B,C, H, W = inv_depth.shape ValueError: not enough values to unpack (expected 4, got 3)

tzanis-anevlavis commented 3 years ago

Hey, did you figure this out by any chance? I also get the same when trying to overfit KITTI tiny.

VitorGuizilini-TRI commented 3 years ago

Are you using the latest version of the repository, after the most recent commit?

tzanis-anevlavis commented 3 years ago

Yes, working with the latest version and docker. It seems that both inputs to 'post_process_inv_depth( )' have a dimension mismatch.

GaneshAdam commented 3 years ago

Hey, Did you guys solved this by any chance. I am facing similar issue. Any help is appreciated. Thanks!!!

tzanis-anevlavis commented 3 years ago

It seems that the functions compute_depth_metrics( ) and post_process_inv_depth( ), which are both called in evaluate_depth( ) (line 291 within the model_wrapper.py) expect their inputs to be of shape [B, C, H, W], but only the first batch is passed. In this example B=1, but that's not what the function expects.

So I did replace inv_depths[0] and inv_depths_flipped[0] with inv_depths and inv_depths_flipped at lines 295 and 301 within evaluate_depth( ). Now it seems to be working, and I got the following results from overfitting KITTI

Screen Shot 2021-06-30 at 11 38 00 PM

I did not have a lot of time to dig more into the code and understand if my reasoning above is correct, but maybe @VitorGuizilini-TRI can verify or shed some more light! :)

porwalnaman01 commented 3 years ago

@GaneshAdam Pulling the latest commited version of the code and then replacing the packnet_sfm/datasets/augmentation.py file with the older version worked for me. Don't know why, but without replacing the augmentation.py file with the older version gave an error. You may first try by using the latest version of code, if still it gives any error you can try replacing the augmentation.py file with it's older version. I am using ubuntu 18.04, cuda 11.1 and a conda environment.

GaneshAdam commented 3 years ago

@janis10, @porwalnaman01..Thank you for quick response. I tried solution suggested by @janis10 , its working now.

stellarpower commented 3 years ago

I am still running into this as of 6e3161f60e7161115813574557761edaffb1b6d1, running the sample command provided in the readme (for the KITTI overfit). I have had to modify the build and bump the versions of dependencies, as our GPU isn't supported by the older version of CUDA - you can see this in my fork here, so I had assumed this was related to breaking changes in PyTorch. However as this issue exists recently, am guessing perhaps it may need to be re-opened.

Happy to provide more details; however was trying to help a colleague who couldn't get this to build, so only have a very high-level idea of what the repo is doing.