TRAILab / CaDDN

Categorical Depth Distribution Network for Monocular 3D Object Detection (CVPR 2021 Oral)
Apache License 2.0
366 stars 62 forks source link

PyTorch exceptions when running train.py or test.py #37

Closed dpwolfe closed 3 years ago

dpwolfe commented 3 years ago

Hello,

Thank you for maintaining this project. I am trying out the test.py and train.py scripts with the kitti dataset from NVIDIA Jetson AGX Xavier and running into an error I am having trouble resolving. Shortly after I get to these log lines, an exception is thrown:

2021-06-07 22:59:42,230   INFO  **********************Start training kitti_models/CaDDN(default)**********************
epochs:   0%|                                                                                                     | 0/80 [00:02<?, ?it/s]

This is the command: python train.py --cfg_file cfgs/kitti_models/CaDDN.yaml --batch_size 2

That exception is this:

Traceback (most recent call last):                                                                              | 0/1856 [00:00<?, ?it/s]
  File "train.py", line 198, in <module>
    main()
  File "train.py", line 170, in main
    merge_all_iters_to_one_epoch=args.merge_all_iters_to_one_epoch
  File "/home/dpwolfe/repo/CaDDN/tools/train_utils/train_utils.py", line 93, in train_model
    dataloader_iter=dataloader_iter
  File "/home/dpwolfe/repo/CaDDN/tools/train_utils/train_utils.py", line 19, in train_one_epoch
    batch = next(dataloader_iter)
  File "/home/dpwolfe/miniforge3/envs/open-pcdet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/home/dpwolfe/miniforge3/envs/open-pcdet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data
    return self._process_data(data)
  File "/home/dpwolfe/miniforge3/envs/open-pcdet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
    data.reraise()
  File "/home/dpwolfe/miniforge3/envs/open-pcdet/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/dpwolfe/miniforge3/envs/open-pcdet/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/dpwolfe/miniforge3/envs/open-pcdet/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/dpwolfe/miniforge3/envs/open-pcdet/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/dpwolfe/repo/OpenPCDet/pcdet/datasets/kitti/kitti_dataset.py", line 424, in __getitem__
    data_dict = self.prepare_data(data_dict=input_dict)
  File "/home/dpwolfe/repo/OpenPCDet/pcdet/datasets/dataset.py", line 129, in prepare_data
    'gt_boxes_mask': gt_boxes_mask
  File "/home/dpwolfe/repo/OpenPCDet/pcdet/datasets/augmentor/data_augmentor.py", line 112, in forward
    data_dict = cur_augmentor(data_dict=data_dict)
  File "/home/dpwolfe/repo/OpenPCDet/pcdet/datasets/augmentor/data_augmentor.py", line 84, in random_image_flip
    images = data_dict["images"]
KeyError: 'images'

While going through the setup process, I've needed to use some slightly different versions of dependencies declared in the requirements file. I've done this to use available pre-built versions for aarch64 and avoid having to build them myself. Those are:

numpy 1.19.5 instead of 1.20.1 scikit-image 0.18.0rc1 instead of 0.18.1 scipy 1.5.4 instead of 1.6.1 tifffile 2020.9.3 instead of 2021.2.26

I've also needed to make a couple major version changes that I'm concerned might be the problem:

kornia 0.5.3 instead of 0.2.2 since 0.2.2 was not readily available for aarch64 torch 1.6.0 since kornia 0.5.3 requires >= 1.6.0

Hardware and Environment: NVIDIA Jetson AGX Xavier Jetpack 4.5 (Ubuntu 18.04) Python 3.6 (using miniforge) CUDA Version: 10.2

My conda list output

# packages in environment at /home/dpwolfe/miniforge3/envs/open-pcdet:
#
# Name                    Version                   Build  Channel
_openmp_mutex             4.5                       1_gnu    conda-forge
absl-py                   0.12.0                   pypi_0    pypi
ca-certificates           2018.03.07                    0    c4aarch64
cachetools                4.2.1                    pypi_0    pypi
certifi                   2020.12.5                pypi_0    pypi
chardet                   4.0.0                    pypi_0    pypi
cycler                    0.10.0                   pypi_0    pypi
dataclasses               0.8                      pypi_0    pypi
decorator                 4.4.2                    pypi_0    pypi
easydict                  1.9                      pypi_0    pypi
future                    0.18.2                   pypi_0    pypi
google-auth               1.27.1                   pypi_0    pypi
google-auth-oauthlib      0.4.3                    pypi_0    pypi
grpcio                    1.36.1                   pypi_0    pypi
idna                      2.10                     pypi_0    pypi
imageio                   2.9.0                    pypi_0    pypi
importlib-metadata        4.5.0                    pypi_0    pypi
kiwisolver                1.3.1                    pypi_0    pypi
kornia                    0.5.3                    pypi_0    pypi
ld_impl_linux-aarch64     2.35.1               h02ad14f_2    conda-forge
libblas                   3.8.0               17_openblas    conda-forge
libcblas                  3.8.0               17_openblas    conda-forge
libffi                    3.3                  h884eca8_2    conda-forge
libgcc-ng                 9.3.0               he1ea209_19    conda-forge
libgfortran-ng            7.3.0                h6bc79d0_0    c4aarch64
libgomp                   9.3.0               he1ea209_19    conda-forge
liblapack                 3.8.0               17_openblas    conda-forge
libllvm10                 10.0.1               he513fc3_3    conda-forge
libllvm12                 12.0.0               h6293a0b_1    conda-forge
libopenblas               0.3.10          pthreads_hb3c22a3_4    conda-forge
libstdcxx-ng              9.3.0               h1ed1776_19    conda-forge
llvm-tools                12.0.0               h6293a0b_1    conda-forge
llvmdev                   12.0.0               h6293a0b_1    conda-forge
llvmlite                  0.35.0           py36h2826d25_1    conda-forge
markdown                  3.3.4                    pypi_0    pypi
matplotlib                3.3.4                    pypi_0    pypi
ncurses                   6.2                  h7fd3ca4_4    conda-forge
networkx                  2.5                      pypi_0    pypi
numba                     0.52.0           py36ha63b481_0    conda-forge
numpy                     1.19.5           py36hdc1b780_1    conda-forge
oauthlib                  3.1.0                    pypi_0    pypi
openssl                   1.1.1k               hf897c2e_0    conda-forge
pcdet                     0.3.0+150b7ba             dev_0    <develop>
pillow                    8.1.2                    pypi_0    pypi
pip                       21.1.2             pyhd8ed1ab_0    conda-forge
protobuf                  3.15.3                   pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pyparsing                 2.4.7                    pypi_0    pypi
python                    3.6.13          h468538b_0_cpython    conda-forge
python-dateutil           2.8.1                    pypi_0    pypi
python_abi                3.6                     1_cp36m    conda-forge
pywavelets                1.1.1                    pypi_0    pypi
pyyaml                    5.4.1                    pypi_0    pypi
readline                  8.1                  h1a49cc3_0    conda-forge
requests                  2.25.1                   pypi_0    pypi
requests-oauthlib         1.3.0                    pypi_0    pypi
rsa                       4.7.2                    pypi_0    pypi
scikit-image              0.18.0rc1                pypi_0    pypi
scipy                     1.5.4                    pypi_0    pypi
setuptools                49.6.0           py36h704843e_3    conda-forge
six                       1.15.0                   pypi_0    pypi
spconv                    1.2.1                    pypi_0    pypi
sqlite                    3.35.5               h43e6a2a_0    conda-forge
tensorboard               2.4.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.0                    pypi_0    pypi
tensorboardx              2.1                      pypi_0    pypi
tifffile                  2020.9.3                 pypi_0    pypi
tk                        8.6.10               ha99a2a3_1    conda-forge
torch                     1.6.0                    pypi_0    pypi
torchvision               0.9.1                    pypi_0    pypi
tqdm                      4.58.0                   pypi_0    pypi
typing-extensions         3.7.4.3                  pypi_0    pypi
urllib3                   1.26.3                   pypi_0    pypi
werkzeug                  1.0.1                    pypi_0    pypi
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xz                        5.2.5                h6dd45c4_1    conda-forge
yolk3k                    0.9                      pypi_0    pypi
zipp                      3.4.1                    pypi_0    pypi
zlib                      1.2.11               h7b6447c_2    c4aarch64

Also, if I run python test.py --cfg_file cfgs/kitti_models/CaDDN.yaml --batch_size 2 --ckpt ../checkpoints/caddn.pth I will get the following:

2021-06-07 23:16:09,131   INFO  Loading KITTI dataset
2021-06-07 23:16:09,576   INFO  Total samples for KITTI dataset: 3769
2021-06-07 23:16:13,328   INFO  ==> Loading parameters from checkpoint ../checkpoints/caddn.pth to GPU
2021-06-07 23:16:16,395   INFO  ==> Checkpoint trained from version: pcdet+0.3.0+0000000
2021-06-07 23:16:18,141   INFO  ==> Done (loaded 229/229)
2021-06-07 23:16:18,239   INFO  *************** EPOCH no_number EVALUATION *****************
eval:   0%|                                                                                                     | 0/1885 [00:00<?, ?it/s]

Followed by this error:

Traceback (most recent call last):
  File "test.py", line 199, in <module>
    main()
  File "test.py", line 195, in main
    eval_single_ckpt(model, test_loader, args, eval_output_dir, logger, epoch_id, dist_test=dist_test)
  File "test.py", line 63, in eval_single_ckpt
    result_dir=eval_output_dir, save_to_file=args.save_to_file
  File "/home/dpwolfe/repo/CaDDN/tools/eval_utils/eval_utils.py", line 57, in eval_one_epoch
    pred_dicts, ret_dict = model(batch_dict)
  File "/home/dpwolfe/miniforge3/envs/open-pcdet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/dpwolfe/repo/OpenPCDet/pcdet/models/detectors/caddn.py", line 11, in forward
    batch_dict = cur_module(batch_dict)
  File "/home/dpwolfe/miniforge3/envs/open-pcdet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/dpwolfe/repo/OpenPCDet/pcdet/models/backbones_2d/map_to_bev/conv2d_collapse.py", line 34, in forward
    voxel_features = batch_dict["voxel_features"]
KeyError: 'voxel_features'
Exception ignored in: <bound method tqdm.__del__ of <tqdm.std.tqdm object at 0x7f849ebb70>>
Traceback (most recent call last):
  File "/home/dpwolfe/miniforge3/envs/open-pcdet/lib/python3.6/site-packages/tqdm/std.py", line 1143, in __del__
  File "/home/dpwolfe/miniforge3/envs/open-pcdet/lib/python3.6/site-packages/tqdm/std.py", line 1297, in close
  File "/home/dpwolfe/miniforge3/envs/open-pcdet/lib/python3.6/site-packages/tqdm/std.py", line 1490, in display
  File "/home/dpwolfe/miniforge3/envs/open-pcdet/lib/python3.6/site-packages/tqdm/std.py", line 1146, in __str__
  File "/home/dpwolfe/miniforge3/envs/open-pcdet/lib/python3.6/site-packages/tqdm/std.py", line 1448, in format_dict
TypeError: 'NoneType' object is not iterable

Have you seen these errors before or do you know if they're caused by the change in one of the dependencies, such as pytorch 1.6.0 instead of 1.4.0?

I greatly appreciate your help. The next path for me otherwise is to unwind my setup and rebuild it with torch 1.4.0 (NVIDIA provides it) since the exception originates from torch. I'll also have to see if I can build kornia 0.2.2 for aarch64.

Thank you!

codyreading commented 3 years ago

Hello!

If you look at your error log you can actually see the issue

File "/home/dpwolfe/repo/OpenPCDet/pcdet/datasets/augmentor/data_augmentor.py", line 84, in random_image_flip

It looks like your environment is using the source code from OpenPCDet rather than CaDDN, which currently are slightly different (Currently working on updating this repo to be in line with OpenPCDet.) It might be that you are using the same conda environment for both, and you installed OpenPCDet more recently so this code is being used.

If you are already using the OpenPCDet repo, my recommendation is to stick with the CaDDN implementation over there. However, if you only require CaDDN then I would recommend to use this source code here and just ensure that these are in separate virtual environments.

dpwolfe commented 3 years ago

Thank you @codyreading for the fast reply! I'll give this a shot and let you know how it goes here soon.

codyreading commented 3 years ago

Also FYI, refer to https://github.com/TRAILab/CaDDN/issues/23 for running on versions > torch 1.4.0. This has been fixed in OpenPCDet but not in CaDDN

dpwolfe commented 3 years ago

Thank you again @codyreading! This worked great along with the path mentioned in #23.