Training get stuck on line 74 'log_dict_train, _ = trainer.train(epoch, train_loader)'

Banconxuan / RTM3D

The official PyTorch Implementation of RTM3D and KM3D for Monocular 3D Object Detection

MIT License

454 stars 85 forks source link

Training get stuck on line 74 'log_dict_train, _ = trainer.train(epoch, train_loader)' #29

Open mertmerci opened 3 years ago

mertmerci commented 3 years ago

Hello,

I am trying to run the main.py file for training but training stuck at line 74 log_dict_train, _ = trainer.train(epoch, train_loader). The strange thing is the when I check the GPU utilization, I see that the GPUs are still 100% in use. Also when I debug, I see that the losses are calculated but they are neither printed in the console nor the logger file.

I am using Cuda version 11.2, maybe this is the problem but I do not think so. Do you have any ideas or suggestions for me to solve this issue?

Thank you in advance.

sparro12 commented 3 years ago

We are also using 11.2 so that is not the issue. If you send the error, I could possibly be of some help.

mertmerci commented 3 years ago

I do not get a run time error. However, I cannot obtain the losses or even see the epochs. I attached the output of the main.py below. I inserted some print lines for debugging; as it can be seen the code terminates after line 74 without finding any losses or proceeding with multiple epochs. Screenshot 2021-03-15 at 23 40 32

sparro12 commented 3 years ago

Before going down the rabbit hole, my best guess would be there is an error with the torch version. I would reinstall torch on that repo. Better yet, since you're not too far into the setup, I would reclone the repo and make sure you select the correct torch version when you set it up. Maybe even redo the Conda environment.

One thing that is noteworthy is that when running DCNv2, it failed for us. So, we had to reclone the DCNv2 repo in the link provided in the install.md. The DCNv2 is precompiled when cloning KM3D and uses CUDA 8.0. However, if you want to run 11.2, you'll need to reclone just the DCNv2 part and keep in the same location as the old one. Then continue on with the rest of instructions.

mertmerci commented 3 years ago

Thank you for your kind response. DCNv2 does not contribute to the problems I think because I am trying to run the training without using the models that have DCNv2, just using basic resnet-18 or dla-34.

I am using Ubuntu18, maybe this can cause a problem. Now, Ubuntu16 is installed on the machine, I will try to run it and post the result here.