alexklwong / calibrated-backprojection-network

PyTorch Implementation of Unsupervised Depth Completion with Calibrated Backprojection Layers (ORAL, ICCV 2021)
Other
117 stars 24 forks source link

Question about RuntimeError: inverse_cuda: For batch 0: U( , ) is zero, singular U. #8

Closed yxx623 closed 2 years ago

yxx623 commented 2 years ago

Hi, Alex, Thank you for your excellent work. I some problem when run the pretrained model and train the model. I haven't change the code, but the following errors were reported. RuntimeError: inverse_cuda: For batch 0: U( , ) is zero, singular U. (The values in parentheses are different each time i run them) Have you met this error before, and how can i solve it? Thanks in advance.

alexklwong commented 2 years ago

That' sounds like a data set up problem, can you provide the commands that you ran starting from the virtual environment and where you ran them from (e.g. root of the repository, or outside the repository, etc.)?

Specifically, can you provide which torch version you had (torch 1.7 seems to have something wrong with it, but 1.8 is okay in https://github.com/alexklwong/calibrated-backprojection-network/issues/7) and which dataset setup, train, or inference bash script did you run?

If this has to do with training, can you also list all the training settings you used and the number of GPUs etc.

yxx623 commented 2 years ago

Thanks for your prompt reply. The commands i ran is same as the code you provided, and i can it in the root of the repository. I built the virtual environment just like yours, the torch vision is 1.3.0. I have tied to run the pretrained model on KITTI validation set, test set, and train model on the KITTI dataset. All of them have indicated the same error. 1640761154(1) The storage of KITTI validation set is shown in the figure, and each folder has 1000 files. I haven't changed any setting, and used 2 GUPs. Thanks!

alexklwong commented 2 years ago

Strange, I've just clone a fresh copy of the repo, created the virtual environment, ran python setup/setup_dataset_kitti.py and ran bash bash/kitti/run_kbnet_kitti_validation.sh, but I didn't see the error.

In general the only spot that would use inverse should be the backprojection step https://github.com/alexklwong/calibrated-backprojection-network/blob/master/src/networks.py#L498 but this shouldn't throw an error because intrinsics matrix is invertible. That's why I think it is a data loading issue.

Can you provide the full stack trace just in case?

I am not sure how the inference script will work if you are using 2 GPUs since batch size is one regardless, but in general you just need one so export CUDA_VISIBLE_DEVICES=0

Training should work just fine with multiple GPUs, this user also used two GPUs https://github.com/alexklwong/calibrated-backprojection-network/issues/5#issuecomment-989638885

s-mostafa-a commented 2 years ago

Hi all, I am facing this problem, too. torch 1.3.0 single GPU when training (train_kbnet_kitti.sh)

alexklwong commented 2 years ago

Perhaps meeting over Google meet or zoom would be easier to trouble shoot this. Would you mind sending me an email to alexw@cs.ucla.edu so that I may schedule 30 minutes to trouble shoot?

s-mostafa-a commented 2 years ago

Hi @alexklwong, thank you so much for your quick support. I used clues from #7 and it worked for me. What I did:

install cuda==11.1
pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio==0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
pip inatall tensorboard==2.3.0
pip install opencv-python scipy scikit-learn scikit-image matplotlib gdown numpy gast Pillow pyyaml

This setting works for python 3.7 in ubuntu 20.04. (GPU: RTX 3090)

I think it's better to update README.md file. Previously, I was following the exact instructions there.

alexklwong commented 2 years ago

Ah I see, I think it might be because of the CUDA for the new RTX 30 series. The ones we tested on where GTX 1080. I'll add instructions for those using newer GPUs.