An error occurred and looking forward to your help

Shi-Qi-Li / DBDNet

[KBS] DBDNet:Partial-to-Partial Point Cloud Registration with Dual Branches Decoupling

MIT License

7 stars 0 forks source link

An error occurred and looking forward to your help #1

Closed WYQ0374 closed 4 months ago

WYQ0374 commented 4 months ago

Hello, your work is very innovative and I am very interested. But when attempting to run your publicly available code, an error occurred. I hope to receive your help. thank you.

error location： When in the second stage of training (Train the registration model): python train.py -- config modelnet40.yaml, an error occurs: FileNotFoundError: [Errno 2] No such file or directory: 'ckpt/modelnet40_overap.noise. py'

Attempt to solve： When I tried to change the overlap retrieval kpt: ckpt/modelnet40_overlap noise.py in DBDNet_paper/config/modelnet40.yaml to DBDNet_paper/exp/modelnet40_overlap noise/2024-06-26-22:41:27/checkpoints/epoch_100.pth, I still reported an error. The error message is: loss ["loss"]. backward() RuntimeError: Function 'LinalgSvdBackward0' returned nan values in its 0th output.

Looking forward to your early reply.

Shi-Qi-Li commented 4 months ago

Hello, your work is very innovative and I am very interested. But when attempting to run your publicly available code, an error occurred. I hope to receive your help. thank you.

error location： When in the second stage of training (Train the registration model): python train.py -- config modelnet40.yaml, an error occurs: FileNotFoundError: [Errno 2] No such file or directory: 'ckpt/modelnet40_overap.noise. py'

Attempt to solve： When I tried to change the overlap retrieval kpt: ckpt/modelnet40_overlap noise.py in DBDNet_paper/config/modelnet40.yaml to DBDNet_paper/exp/modelnet40_overlap noise/2024-06-26-22:41:27/checkpoints/epoch_100.pth, I still reported an error. The error message is: loss ["loss"]. backward() RuntimeError: Function 'LinalgSvdBackward0' returned nan values in its 0th output.

Looking forward to your early reply.

Hi @WYQ0374 Thanks for your interests. There is a typo about the extension of checkpoint path, it should be .pth rather than .py. I have fixed the issue in a new commit, thank you for pointing out it. And the 2nd stage training is indeed a little unstable, maybe you can slightly adjust the learning rate in your experiment, I think it will not bring a large difference.

WYQ0374 commented 4 months ago

Thank you for your prompt reply. I would greatly appreciate it. My remaining question is that the first stage of training did not generate the file (ckpt/modelnet40-overlap-noise. pth), making it impossible to proceed with the second stage of training. So, how to generate this file? Thank you for your help.

Shi-Qi-Li commented 4 months ago

Thank you for your prompt reply. I would greatly appreciate it. My remaining question is that the first stage of training did not generate the file (ckpt/modelnet40-overlap-noise. pth), making it impossible to proceed with the second stage of training. So, how to generate this file? Thank you for your help.

Actually the 1st stage did not generate the file, I just rename the best checkpoint from epoch-xx.pth to modelnet40-overlap-noise.pth for better management during the release. So your initial attempt about change the config file is correct.

WYQ0374 commented 4 months ago

Thank you for your help, and I wish you all the best in your research. ---- Replied Message ---- FromShiqi @.>Date06/30/2024 16:32 @.> @.>, @.>SubjectRe: [Shi-Qi-Li/DBDNet] An error occurred and looking forward to your help (Issue #1) Thank you for your prompt reply. I would greatly appreciate it. My remaining question is that the first stage of training did not generate the file (ckpt/modelnet40-overlap-noise. pth), making it impossible to proceed with the second stage of training. So, how to generate this file? Thank you for your help. Actually the 1st stage did not generate the file, I just rename the best checkpoint from epoch-xx.pth to modelnet40-overlap-noise.pth for better management during the release. So your initial attempt about change the config file is correct. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

WYQ0374 commented 4 months ago

hello, I have a question that I would like to ask you.

During the 2nd stage of training, the following errors always occur:【 File "/home/wangyongqiang/miniconda3/envs/dbd2/lib/python3.9/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Function 'LinalgSvdBackward0' returned nan values in its 0th output.】

Attempting a solution: When commenting out the following code (train.py torch. autorad. set_detect-anomaly (True)) This error will occur again: U, S, V=torch. svd (H, some=False) torch._C._LinAlgError: cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, when calling cusolverDnSgesvd( handle, jobu, jobvt, m, n, A, lda, S, U, ldu, VT, ldvt, work, lwork, rwork, info). This error may appear if the input matrix contains NaN.】

Requesting your help. thank you

Shi-Qi-Li commented 4 months ago

hello, I have a question that I would like to ask you.

During the 2nd stage of training, the following errors always occur:【 File "/home/wangyongqiang/miniconda3/envs/dbd2/lib/python3.9/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Function 'LinalgSvdBackward0' returned nan values in its 0th output.】

Attempting a solution: When commenting out the following code (train.py torch. autorad. set_detect-anomaly (True)) This error will occur again: U, S, V=torch. svd (H, some=False) torch._C._LinAlgError: cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, when calling cusolverDnSgesvd( handle, jobu, jobvt, m, n, A, lda, S, U, ldu, VT, ldvt, work, lwork, rwork, info). This error may appear if the input matrix contains NaN.】

Requesting your help. thank you

I think it might be helpful to first check whether the tensor corresponding to the source and target point clouds after overlap selection contains nan.