Problems with the evaluation of jdacs-ms

LiPan123456789 commented 2 years ago

After my training is completed, I get an error when I execute the sh train.sh command for evaluation. I have checked the memory of the graphics card, which is not the problem. I also checked the two imp1 and imp2 in the picture below, and the dimensions are no problem. I hope to get your reply.

LiPan123456789 commented 2 years ago

sorry, it should be tmp1 and tmp2, and the test command is sh test.sh.

ToughStoneX commented 2 years ago

Hi, here are several comments: 1, I check the original code and notice that the code on line 196 is not the same as you. Are you using the original repository of JDACS-MS? 2, Maybe you can try to replace the function torch.matmul with torch.bmm. You can find it from the following code here:

ans = torch.matmul(tmp1, tmp2)
# ans = torch.bmm(tmp1, tmp2)

Try to comment the upper one and uncomment the lower one, before running the evaluation code again.

LiPan123456789 commented 2 years ago

Hi, after checking, I did use the original code from github. Then, after following your suggestion and modification, the following error is reported, the memory is still not exceeded, and I look forward to your reply again.

LiPan123456789 commented 2 years ago

While i solve this problem 1. When torch.matmul is used, it will appear: RuntimeError: CUDA error: invalid configuration argument 2. When torch.bmm is used, it will appear: RuntimeError: cuda runtime error (9) : invalid configuration argument at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/generic/THCTensorMathPointwise.cu:73 So I think it may be the problem caused by multiplication, and I also checked that the torch version is also 1.1.0 which meets the requirements of the readme. I am always waiting for your reply.

ToughStoneX commented 2 years ago

Hi, I don't think the current problem is caused by matrix multiplication now. From the provided logs in the figures, when torch.bmm() is used, it can be found that the error was not on the same line as the first one you posted. Check the green lines in the following figure: Snipaste_2022-03-21_16-39-57 The error is on line 200 and the code is: interval_maps = torch.abs(delta) ....... Hence, the posted error now is not caused by the mentioned code ans=torch.bmm(tmp1, tmp2).

Here are several comments: 1, By replacing the code torch.matmul with torch.bmm should be able to solve the first question. It may be caused by the error of Pytorch with the elder version, whose support for torch.matmul is not perfect enough. Because it can not automatically handle the matrix multiplication with batches. Using torch.bmm can handle this. You can print the variable ans out and check it. 2, Try to divide the code on line 200 into several lines. Use print or some other debugging tools to check out which part throws out that error. 3, Maybe try to upgrade the torch version to slightly higher ones, such as 1.2.0 or 1.1.x. 4, This part of the code is borrowed from CVP-MVSNet. You can try to ask the author for some suggestions as well.

LiPan123456789 commented 2 years ago

Hello, I solved this problem by upgrading pytorch to 1.2.0. Since I have been sticking to 1.1.0 given in the readme, the problem has not been solved. By the way, matrix multiplication is still the matmul method. Finally thank you for your reply!

ToughStoneX / Self-Supervised-MVS

Problems with the evaluation of jdacs-ms #13