Shi-Qi-Li / DBDNet

[KBS] DBDNet:Partial-to-Partial Point Cloud Registration with Dual Branches Decoupling
MIT License
5 stars 0 forks source link

Seeking Your Help - The Second Stage of Training Failed #3

Open WYQ0374 opened 1 week ago

WYQ0374 commented 1 week ago

Your work is very creative, and I would like to try running your open-source code. When I was in the second stage of training(Train the registration model:python train.py --config config/modelnet40.yaml), the following error occurred. I tried to solve it, but failed. Requesting your assistance.

Epoch [1/300]: 4%|███▋ | 25/639 [00:22<09:18, 1.10it/s, loss=3.42] Traceback (most recent call last): File "/home/wang/A1WYQ/DBDNet_WYQ/DBDNet_paper0918/train.py", line 181, in main() File "/home/wang/A1WYQ/DBDNet_WYQ/DBDNet_paper0918/train.py", line 162, in main train_results = train_step(train_loader, model, optimizer, loss_func, epoch + 1, cfg.epoch, writer, train_vis_items)
File "/home/wang/A1WYQ/DBDNet_WYQ/DBDNet_paper0918/train.py", line 58, in train_step loss["loss"].backward() File "/home/wang/miniconda3/envs/dbd4/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/wang/miniconda3/envs/dbd4/lib/python3.9/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Function 'LinalgSvdBackward0' returned nan values in its 0th output.

Shi-Qi-Li commented 4 days ago

Hi @WYQ0374, Sorry for the late reply. The second training stage is indeed a little unstable. You may consider first enlarge the learning rate to train 1 or 2 epoch, then load this weights, recover the learning rate to 1e-4 and train the model.