Testing error - Githubissues

Tsumugii24 commented 7 months ago

hi there, the inference is ok, but then i ran into an error while testing, i am not sure what caused this error and how to fix it my GPU is a Tesla V100-SXM2-32GB here is the error info

(sifu) lby@ubuntu:~/code/SIFU$ python -m apps.train -cfg ./configs/train/sifu.yaml -test

ICON:
w/ Global Image Encoder: True
Image Features used by MLP: ['normal_F', 'normal_B']
Geometry Features used by MLP: ['sdf', 'cmap', 'norm', 'vis', 'sample_id']
Dim of Image Features (local): 6
Dim of Geometry Features (ICON): 7
Dim of MLP's first layer: 78

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Resume MLP weights from ./data/ckpt/sifu.ckpt
Resume normal model from ./data/ckpt/normal.ckpt
load from ./data/cape/test.txt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing: 0it [00:00, ?it/s]../aten/src/ATen/native/cuda/MultinomialKernel.cu:109: binarySearchForMultinomial: block: [7,0,0], thread: [96,0,0] Assertion `cumdist[size - 1] > static_cast<scalar_t>(0)` failed.
../aten/src/ATen/native/cuda/MultinomialKernel.cu:109: binarySearchForMultinomial: block: [7,0,0], thread: [97,0,0] Assertion `cumdist[size - 1] > static_cast<scalar_t>(0)` failed.
......(omit)
../aten/src/ATen/native/cuda/MultinomialKernel.cu:109: binarySearchForMultinomial: block: [2,0,0], thread: [95,0,0] Assertion `cumdist[size - 1] > static_cast<scalar_t>(0)` failed.
Traceback (most recent call last):
  File "/home/lby/miniconda3/envs/sifu/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/lby/miniconda3/envs/sifu/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/lby/code/SIFU/apps/train.py", line 157, in <module>
    trainer.test(model=model, datamodule=datamodule)
  File "/home/lby/miniconda3/envs/sifu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 915, in test
    results = self.__test_given_model(model, test_dataloaders)
  File "/home/lby/miniconda3/envs/sifu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 973, in __test_given_model
    results = self.fit(model)
  File "/home/lby/miniconda3/envs/sifu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/home/lby/miniconda3/envs/sifu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in dispatch
    self.accelerator.start_testing(self)
  File "/home/lby/miniconda3/envs/sifu/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 76, in start_testing
    self.training_type_plugin.start_testing(trainer)
  File "/home/lby/miniconda3/envs/sifu/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 118, in start_testing
    self._results = trainer.run_test()
  File "/home/lby/miniconda3/envs/sifu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 786, in run_test
    eval_loop_results, _ = self.run_evaluation()
  File "/home/lby/miniconda3/envs/sifu/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 725, in run_evaluation
    output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
  File "/home/lby/miniconda3/envs/sifu/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 160, in evaluation_step
    output = self.trainer.accelerator.test_step(args)
  File "/home/lby/miniconda3/envs/sifu/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 195, in test_step
    return self.training_type_plugin.test_step(*args)
  File "/home/lby/miniconda3/envs/sifu/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 134, in test_step
    return self.lightning_module.test_step(*args, **kwargs)
  File "/home/lby/code/SIFU/apps/ICON.py", line 686, in test_step
    chamfer, p2s = self.evaluator.calculate_chamfer_p2s(num_samples=1000)
  File "/home/lby/code/SIFU/lib/dataset/Evaluator.py", line 167, in calculate_chamfer_p2s
    sample_points_from_meshes(self.tgt_mesh, num_samples))
  File "/home/lby/code/pytorch3d/pytorch3d/ops/sample_points_from_meshes.py", line 100, in sample_points_from_meshes
    sample_face_idxs += mesh_to_face[meshes.valid].view(num_valid_meshes, 1)
RuntimeError: numel: integer multiplication overflow
Testing:   0%|          | 0/450 [00:06<?, ?it/s]

River-Zhang commented 6 months ago

Hi @Tsumugii24 . I'm sorry that I saw this issue too late. It seems that this error comes from cuda and pytorch3D. Could you please tell me what version you are using?

Tsumugii24 commented 6 months ago

sure, hope these information will help!

(sifu) lby@ubuntu:~$ conda list

python                    3.8.19               h955ad1f_0    defaults
pytorch3d                 0.7.6                     dev_0    <develop>
torch                     1.13.0+cu117             pypi_0    pypi
torchaudio                0.13.0+cu117             pypi_0    pypi
torchmetrics              1.3.2                    pypi_0    pypi
torchvision               0.14.0+cu117             pypi_0    pipit

(sifu) lby@ubuntu:~$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

River-Zhang commented 6 months ago

I think the error may be caused by pytorch3D because your version is the latest. You can try using a lower version such as 0.7.3 or 0.7.2 (I use a version of 0.7.3). I am not very sure about it. If it was caused by the version of pytorch3D, I'll update it in the instructions. Thanks very much!

Tsumugii24 commented 6 months ago

Hi, @River-Zhang You are right! I change the version of pytorch3D from 0.7.6 to 0.7.3 and then everything just works well. I think the reason why I used the latest version is that I follow the following instruction and pull the main branch

PyTorch3D (official INSTALL.md, recommend install-from-local-clone)

before the command in the requirements file which contains the appropriate version

pip install -r requirements.txt

Anyway, thanks very much!

River-Zhang commented 6 months ago

Thanks for your feedback! I'll update the installation instructions.

River-Zhang / SIFU

Testing error #19