Xiangyu-Gao / Radar-multiple-perspective-object-detection

Codes and template data for paper "RAMP-CNN: A Novel Neural Network for Enhanced Automotive Radar Object Recognition"
MIT License
51 stars 18 forks source link

Questions about understanding code #2

Open Xiangyu-Gao opened 1 year ago

Xiangyu-Gao commented 1 year ago

Please post your questions about understanding the code (e.g. the functionality of each .py script) here and also please feel free to answer questions or post a discussion. I will also check this issue occasionally.

If you have bugs when running the original code downloaded from this repo, please raise a new issue so that we can notice it ASAP.

PalgunaGopireddy commented 1 year ago

Hello. There are number of files. I am wondering about the sequence of programs to execute to understand it. Could you please enlighten me?

Xiangyu-Gao commented 1 year ago

sure. Run the code in the following sequence: python slice3d.py -> python prepare_data.py -m train -dd './data/' -> python prepare_data.py -m test -dd './data/' -> python train_dop.py -m C3D -> python test.py -m C3D -md C3D-20200904-001923 -> python evaluate.py -md C3D-20200904-001923

This is given in the Readme with more details.

PalgunaGopireddy commented 1 year ago

Could you specify how to run below statements.

python prepare_data.py -m train -dd './data/' python prepare_data.py -m test -dd './data/'

I don't understand what '-m' is. I see its a mode you have given in parse_args() function in prepare_data.py. But I don't know how to run it like that. I am using google Colab.

Xiangyu-Gao commented 1 year ago

These (e.g., '-m', '-dd') are self-defined python command line arguments. Check it out here .

PalgunaGopireddy commented 1 year ago

Thanks @Xiangyu-Gao. I understood it. I am able to run it in Google colab instead for Ubuntu. Just a correction needed for running commands in Goolge Colab. Need to use run prepare_data.py -m train -dd './data/' instead of python prepare_data.py -m train -dd './data/'.

Thank you.

PalgunaGopireddy commented 1 year ago

I was able to run all the preceding codes. Now I am at run train_dop.py -m C3D. It is giving me following error. How can I resolve it.

`No data augmentation Number of sequences to train: 1 Training files length: 111 Window size: 16 Number of epoches: 100 Batch size: 3 Number of iterations in each epoch: 37 Cyclic learning rate epoch 1, iter 1: loss: 11288.51367188 | load time: 17.2824 | backward time: 8.7006 /usr/local/lib/python3.7/dist-packages/numpy/core/fromnumeric.py:1970: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. result = asarray(a).shape

RuntimeError Traceback (most recent call last) /content/drive/MyDrive/Colab Notebooks/github files/Radar-multiple-perspective-object-detection-main/train_dop.py in 202 for epoch in range(epoch_start, n_epoch): 203 tic_load = time.time() --> 204 for iter, loaded_data in enumerate(dataloader): 205 if args.model == 'C3D': 206 data, data_rv, data_va, confmap_gt, obj_info, real_id = loaded_data

7 frames`

Xiangyu-Gao commented 1 year ago

I didn't see where the error is from.

PalgunaGopireddy commented 1 year ago

It is a runtimError originating at line 204 in train_dop.py --> 204 for iter, loaded_data in enumerate(dataloader): I guess it is the problem with the module 'torch' or 'numpy'. I am using google colab. When I ran run train_dop.py -m C3D. It is saying

VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. result = asarray(a).shape RuntimeError: each element in list of batch should be of equal size

I found that instead on using run, we can also run using !python or !python3 in google Colab

So When I ran !python3 run train_dop.py -m C3D It is generating

VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. result = asarray(a).shape File train_dop.py, line 204, in for iter, loaded_data in enumerate(dataloader): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 681, in next data = self._next_data() File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 721, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch return self.collate_fn(data) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 175, in default_collate return [default_collate(samples) for samples in transposed] # Backwards compatibility. File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 175, in return [default_collate(samples) for samples in transposed] # Backwards compatibility. File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 178, in default_collate return elem_type([default_collate(samples) for samples in transposed]) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 178, in return elem_type([default_collate(samples) for samples in transposed]) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 171, in default_collate raise RuntimeError('each element in list of batch should be of equal size')

RuntimeError: each element in list of batch should be of equal size

Google Colab has all the requirement module verions mentioned in the requirement.txt Here are the versions it has

matplotlib 3.3.2 numpy 1.21.6 pandas 1.3.5 scipy 1.7.3 torch 1.12.1+cu113 I upgraded pillow from 7.1.2 to 9.3.0 tensorboardX was not preinstalled I installed it. 2.5.1

What do you think caused this issue. Could you run the same way in Google Colab and tell me if you are not encountering the same issue?

`

Xiangyu-Gao commented 1 year ago

I guess the error is from the dataLoader where the program did not get the data (e.g., inputs, gts) with the same length. I and other users did not meet this issue before while running the code locally. So I suggest to double-check the configuration of google colab / data location / try to reduce the batch size.

PalgunaGopireddy commented 9 months ago

Hi Gao. Could you tell me these things which I do not understand in the code? How can I create the train and test data for the whole UWCR dataset like it is provided for two scenarios in train_test_data.zip which is to be copied to ./template_files/train_test_data.

Also, How come training, validation, and testing are performed? As there is only train and test datasets provided in the train_test_data.zip

Xiangyu-Gao commented 9 months ago

Sure things. 1)How can I create the train and test data provided in train_test_data.zip => Run python slice3d.py to get the 3 slices and save them accordingly to form the train_test_data, 2) could you tell is it already augmented data? => It is not. You can use the data_aug.py to generate more data. 3) The train_test_data is just the sample data for showing the train/test script can work. You may re-split the whole data for validation purposes.

Xiangyu

On Sun, Sep 10, 2023 at 11:54 PM PalgunaGopireddy @.***> wrote:

Hi Gao. Could you tell me these two things, which I do not understand in the code? How can I create the train and test data provided in train_test_data.zip which is to be copied to ./template_files/train_test_data.

Also, could you tell is it already augmented data? Because it is used for training before running python data_aug.py Also, How come training, validation and testing are performed? As there is only train and test datasets provided in the train_test_data

— Reply to this email directly, view it on GitHub https://github.com/Xiangyu-Gao/Radar-multiple-perspective-object-detection/issues/2#issuecomment-1713275531, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALGE5XNPN7MWJEZCXDPGUP3XZ2YQDANCNFSM54IL7FYA . You are receiving this because you were mentioned.Message ID: <Xiangyu-Gao/Radar-multiple-perspective-object-detection/issues/2/1713275531 @github.com>

-- Xiangyu Gao, Ph.D. Electrical and Computer Engineering Department University of Washington

+1 206 412 8369 307R EEB, Paul Allen Center, Seattle, WA 98195 Contact: @. - @. Website: https://xiangyu-gao.github.io/

PalgunaGopireddy commented 9 months ago

Thanks. The generated RA_slice in slice_3d.py has shape [128,128,255,2], the saved RA_NPY folder of train_test_data.zip have shape [128,128,2]. Did you save RA_slice[128,128,0,2] or Did you save np.mean(RA_slice[128,128,255,2], axis =2) in the RA_NPY folder of train_test_data.zip.

Xiangyu-Gao commented 9 months ago

As mentioned in the paper, we randomly chose one chirp from 255 for RA_NPY. I will make a script about how to save the RA_NPY, RV_NPY, and VA_NPY.

On Wed, Sep 13, 2023 at 1:08 AM PalgunaGopireddy @.***> wrote:

Thanks. The generated RA_slice in slice_3d.py has shape [128,128,255,2], the saved RA_NPY folder of train_test_data.zip have shape [128,128,2]. So did you save RA_slice[128,128,0,2] or Did you save np.mean(RA_slice[128,128,255,2], axis =2) in the RA_NPY folder of train_test_data.zip.

— Reply to this email directly, view it on GitHub https://github.com/Xiangyu-Gao/Radar-multiple-perspective-object-detection/issues/2#issuecomment-1717148509, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALGE5XNHHGU47GYW4IPOG23X2FSYPANCNFSM54IL7FYA . You are receiving this because you were mentioned.Message ID: <Xiangyu-Gao/Radar-multiple-perspective-object-detection/issues/2/1717148509 @github.com>

-- Xiangyu Gao, Ph.D. Electrical and Computer Engineering Department University of Washington

+1 206 412 8369 307R EEB, Paul Allen Center, Seattle, WA 98195 Contact: @. - @. Website: https://xiangyu-gao.github.io/

PalgunaGopireddy commented 9 months ago

Thank Gao. You may wish to add the script if you feel. I think that won't be necessary. I have used an extra line in the script for saving randomly RA_NPY using np.save(location,RA_slice[:, :, np.random.randint(RA_slice.shape, size=1), :])

PalgunaGopireddy commented 9 months ago

From the repository, training and test folders given. So The model is run with training and validation datasets right? No test dataset is used.

jimvermunt commented 9 months ago

I ran into the same problem, I am running it on a HPC server of the university. https://github.com/Xiangyu-Gao/Radar-multiple-perspective-object-detection/issues/2#issuecomment-1308230418

I was not able to resolve the problem following https://github.com/Xiangyu-Gao/Radar-multiple-perspective-object-detection/issues/2#issuecomment-1308237078.

What was the solution to the problem?

When shuffle is turned off on the dataloader, and batch size is set to 2 the error occurs at iteration number 30.

PalgunaGopireddy commented 9 months ago

@jimvermunt . I have not solved https://github.com/Xiangyu-Gao/Radar-multiple-perspective-object-detection/issues/2#issuecomment-1308230418 problem yet. I guess it's due to the number of samples available and batch-size. The general CNN model will consider the last batch-size the as the number of samples left. I think this is where this model is giving error.

Please do update if you overcome it.

jimvermunt commented 9 months ago

https://github.com/Xiangyu-Gao/Radar-multiple-perspective-object-detection/issues/2#issuecomment-1740322150 If I set the batch size to 1, the model will not throw an error but it will only show a warning:

The warning: ~/lib64/python3.6/site-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes ) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray return array(a, dtype, copy=False, order=order)

There seems to be an inconsistency in the size of the samples in a single loaded data, every time I restart the kernel it throws a warning at the 66th sample:

import os
import time
import json
import argparse

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR, CyclicLR
from torch.utils.data import DataLoader
# from tensorboardX import SummaryWriter
from model.loss2 import Cont_Loss
from model.loss import FocalLoss

from data_aug import Aug_data
from config import n_class, train_sets, camera_configs, radar_configs, rodnet_configs
from config import mean1, std1, mean2, std2, mean1_rv, std1_rv, mean2_rv, std2_rv, mean1_va, std1_va, mean2_va, std2_va

from model.RODNet_3D import RODNet
from dataLoader.CRDatasets import CRDataset, CRDatasetSM
from dataLoader.CRDataLoader import CRDataLoader

batch_size = 1
win_size = 8

crdata_train = CRDataset(os.path.join('./data/', 'data_details'),
                         os.path.join('./data/', 'confmaps_gt'),
                         win_size=win_size, set_type='train', stride=8)

dataloader = DataLoader(crdata_train, batch_size=batch_size, shuffle=False, num_workers=0) # x.x set shuffle to false`
for iter, loaded_data in enumerate(dataloader):
    print(f"iteration {iter}, \n length loaded data: {len(loaded_data)}")
    print(f"loaded data 0 shape {loaded_data[0].shape}")
    print(f"loaded data real_id shape {loaded_data[5]}, {len(loaded_data[5])}")

Setting the batch size to 1 makes it possible to train the model, however, using such small batches is not always preferrable.

PalgunaGopireddy commented 9 months ago

True. I guess only the author of the code @Xiangyu-Gao only can explain, how he is able to avoid it.

In this issue, which was occured due to addressing convention, he said, he ran the code using Linux., that's why it did not occur to him.

May be this also is the same reason. But I don't know. I do not have Linux, but I have installed WSL2 on windows. I will run this code on it and update it here. Meanwhile if you have Linux please do run this.

Xiangyu-Gao commented 9 months ago

For this issues https://github.com/Xiangyu-Gao/Radar-multiple-perspective-object-detection/issues/2#issuecomment-1740322150, it is caused by your PyTorch dataloader which sometimes load irregular length data. At least for some pytoch versions, this happens. Can you try pytoch 1.5.1 and use the suggested package versions in here?

Xiangyu-Gao commented 9 months ago

From the repository, training and test folders given. So The model is run with training and validation datasets right? No test dataset is used.

The train_dop.py use the training dataset. The test.py use the testing dataset. I did not setup validation in the script, you can always split the training data or add another for validation purposes.

PalgunaGopireddy commented 9 months ago

For this issues #2 (comment), it is caused by your PyTorch dataloader which sometimes load irregular length data. At least for some pytoch versions, this happens. Can you try pytoch 1.5.1 and use the suggested package versions in here?

I installed the same versions in the requirements.txt for all packages. For torch I installed torch==1.5.1+cu101 But I got this error:

No data augmentation
Number of sequences to train: 1
Training files length: 111
Window size: 16
Number of epoches: 100
Batch size: 3
Number of iterations in each epoch: 37
Cyclic learning rate
/home/palguna/.local/lib/python3.8/site-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order)
Traceback (most recent call last):
  File "train_dop.py", line 261, in <module>
    confmap_preds, confmap_preds2 = rodnet(data.float().cuda(), data_rv.float().cuda(), data_va.float().cuda())
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/Palguna/github/MobileNet/Radar-multiple-perspective-object-detection-main/model/RODNet_3D.py", line 33, in forward
    x_rv = self.c3d_encode_rv(x_rv)
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/Palguna/github/MobileNet/Radar-multiple-perspective-object-detection-main/model/CDC.py", line 454, in forward
    x = self.relu(self.bn1a(self.conv1a(x)))  # (B, 2, W, 128, 128) -> (B, 64, W, 128, 128)
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 103, in forward
    return F.batch_norm(
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 1921, in batch_norm
    return torch.batch_norm(
RuntimeError: CUDA error: unknown error

I installed torch==1.5.1 I got this error.

No data augmentation
Number of sequences to train: 1
Training files length: 111
Window size: 16
Number of epoches: 100
Batch size: 3
Number of iterations in each epoch: 37
Cyclic learning rate
Traceback (most recent call last):
  File "train_dop.py", line 261, in <module>
    confmap_preds, confmap_preds2 = rodnet(data.float().cuda(), data_rv.float().cuda(), data_va.float().cuda())
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/Palguna/github/MobileNet/Radar-multiple-perspective-object-detection-main/model/RODNet_3D.py", line 33, in forward
    x_rv = self.c3d_encode_rv(x_rv)
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/Palguna/github/MobileNet/Radar-multiple-perspective-object-detection-main/model/CDC.py", line 454, in forward
    x = self.relu(self.bn1a(self.conv1a(x)))  # (B, 2, W, 128, 128) -> (B, 64, W, 128, 128)
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 94, in forward
    return F.relu(input, inplace=self.inplace)
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 1063, in relu
    result = torch.relu(input)
RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 4.00 GiB total capacity; 3.13 GiB already allocated; 0 bytes free; 3.16 GiB reserved in total by PyTorch)

May be it is becuase of python version. I am running python 3.8.10 I will change it to 3.6 and try.

Xiangyu-Gao commented 9 months ago

For this issues #2 (comment), it is caused by your PyTorch dataloader which sometimes load irregular length data. At least for some pytoch versions, this happens. Can you try pytoch 1.5.1 and use the suggested package versions in here?

I installed the same versions in the requirements.txt for all packages. For torch I installed torch==1.5.1+cu101 But I got this error:

No data augmentation
Number of sequences to train: 1
Training files length: 111
Window size: 16
Number of epoches: 100
Batch size: 3
Number of iterations in each epoch: 37
Cyclic learning rate
/home/palguna/.local/lib/python3.8/site-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order)
Traceback (most recent call last):
  File "train_dop.py", line 261, in <module>
    confmap_preds, confmap_preds2 = rodnet(data.float().cuda(), data_rv.float().cuda(), data_va.float().cuda())
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/Palguna/github/MobileNet/Radar-multiple-perspective-object-detection-main/model/RODNet_3D.py", line 33, in forward
    x_rv = self.c3d_encode_rv(x_rv)
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/Palguna/github/MobileNet/Radar-multiple-perspective-object-detection-main/model/CDC.py", line 454, in forward
    x = self.relu(self.bn1a(self.conv1a(x)))  # (B, 2, W, 128, 128) -> (B, 64, W, 128, 128)
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 103, in forward
    return F.batch_norm(
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 1921, in batch_norm
    return torch.batch_norm(
RuntimeError: CUDA error: unknown error

I installed torch==1.5.1 I got this error.

No data augmentation
Number of sequences to train: 1
Training files length: 111
Window size: 16
Number of epoches: 100
Batch size: 3
Number of iterations in each epoch: 37
Cyclic learning rate
Traceback (most recent call last):
  File "train_dop.py", line 261, in <module>
    confmap_preds, confmap_preds2 = rodnet(data.float().cuda(), data_rv.float().cuda(), data_va.float().cuda())
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/Palguna/github/MobileNet/Radar-multiple-perspective-object-detection-main/model/RODNet_3D.py", line 33, in forward
    x_rv = self.c3d_encode_rv(x_rv)
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/Palguna/github/MobileNet/Radar-multiple-perspective-object-detection-main/model/CDC.py", line 454, in forward
    x = self.relu(self.bn1a(self.conv1a(x)))  # (B, 2, W, 128, 128) -> (B, 64, W, 128, 128)
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 94, in forward
    return F.relu(input, inplace=self.inplace)
  File "/home/palguna/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 1063, in relu
    result = torch.relu(input)
RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 4.00 GiB total capacity; 3.13 GiB already allocated; 0 bytes free; 3.16 GiB reserved in total by PyTorch)

May be it is becuase of python version. I am running python 3.8.10 I will change it to 3.6 and try.

This error means it is out of gpu memory for running the code

jimvermunt commented 9 months ago

Could you share your pip3 freeze? #issuecomment-1741225340

Mine is the following:

certifi==2023.7.22
cycler==0.11.0
future==0.18.3
kiwisolver==1.3.1
matplotlib==3.3.1
numpy==1.19.2
pandas==1.1.3
Pillow==8.4.0
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
scipy==1.5.2
six==1.16.0
torch==1.5.1+cu101
torchvision==0.6.1+cu101

With these packages I run into the following runtime error:

No data augmentation
args.data_dir ./data/
Number of sequences to train: 1
Training files length: 111
Window size: 16
Number of epoches: 100
Batch size: 2
Number of iterations in each epoch: 55
Cyclic learning rate
x_ra.shape:  torch.Size([2, 2, 32, 128, 128])
x_rv.shape:  torch.Size([2, 1, 32, 128, 128])
x_va.shape:  torch.Size([2, 1, 32, 128, 128])
Traceback (most recent call last):
  File "train_dop.py", line 269, in <module>
    confmap_preds, confmap_preds2 = rodnet(data.float().cuda(), data_rv.float().cuda(), data_va.float().cuda())
  File "/home/tue/20204239/77ghzradarpipeline/Radar-multiple-perspective-object-detection/env1/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tue/20204239/77ghzradarpipeline/Radar-multiple-perspective-object-detection/model/RODNet_3D.py", line 35, in forward
    x_ra = self.c3d_encode_ra(x_ra)
  File "/home/tue/20204239/77ghzradarpipeline/Radar-multiple-perspective-object-detection/env1/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tue/20204239/77ghzradarpipeline/Radar-multiple-perspective-object-detection/model/CDC.py", line 375, in forward
    x = self.relu(self.bn1a(self.conv1a(x)))  # (B, 2, W, 128, 128) -> (B, 64, W, 128, 128) Note: W~2W in this case
  File "/home/tue/20204239/77ghzradarpipeline/Radar-multiple-perspective-object-detection/env1/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tue/20204239/77ghzradarpipeline/Radar-multiple-perspective-object-detection/env1/lib64/python3.6/site-packages/torch/nn/modules/conv.py", line 485, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

The first search on stack overflow suggests that this is due to the lack of VRAM, which seems unreasonable because I run it on an 80GB machine with a batch size of 2 stackoverflow. My estimation is that it probably should take around ~6/8 GB to train with a batch size of 2.

The only difference is that I have not installed tensorboardX which I commented out in the code and, Pillow version 8.4.0 because installing Pillow==9.1.0 zero gives the following error:

Collecting pandas==1.1.3
  Using cached pandas-1.1.3-cp36-cp36m-manylinux1_x86_64.whl (9.5 MB)
ERROR: Could not find a version that satisfies the requirement Pillow==9.1.0 (from versions: 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7.0, 1.7.1, 1.7.2, 1.7.3, 1.7.4, 1.7.5, 1.7.6, 1.7.7, 1.7.8, 2.0.0, 2.1.0, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.0, 2.5.0, 2.5.1, 2.5.2, 2.5.3, 2.6.0, 2.6.1, 2.6.2, 2.7.0, 2.8.0, 2.8.1, 2.8.2, 2.9.0, 3.0.0, 3.1.0rc1, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.4.0, 3.4.1, 3.4.2, 4.0.0, 4.1.0, 4.1.1, 4.2.0, 4.2.1, 4.3.0, 5.0.0, 5.1.0, 5.2.0, 5.3.0, 5.4.0, 5.4.1, 6.0.0, 6.1.0, 6.2.0, 6.2.1, 6.2.2, 7.0.0, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 8.0.0, 8.0.1, 8.1.0, 8.1.1, 8.1.2, 8.2.0, 8.3.0, 8.3.1, 8.3.2, 8.4.0)
ERROR: No matching distribution found for Pillow==9.1.0
Xiangyu-Gao commented 9 months ago

Could you share your pip3 freeze? #issuecomment-1741225340

Mine is the following:

certifi==2023.7.22
cycler==0.11.0
future==0.18.3
kiwisolver==1.3.1
matplotlib==3.3.1
numpy==1.19.2
pandas==1.1.3
Pillow==8.4.0
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
scipy==1.5.2
six==1.16.0
torch==1.5.1+cu101
torchvision==0.6.1+cu101

With these packages I run into the following runtime error:

No data augmentation
args.data_dir ./data/
Number of sequences to train: 1
Training files length: 111
Window size: 16
Number of epoches: 100
Batch size: 2
Number of iterations in each epoch: 55
Cyclic learning rate
x_ra.shape:  torch.Size([2, 2, 32, 128, 128])
x_rv.shape:  torch.Size([2, 1, 32, 128, 128])
x_va.shape:  torch.Size([2, 1, 32, 128, 128])
Traceback (most recent call last):
  File "train_dop.py", line 269, in <module>
    confmap_preds, confmap_preds2 = rodnet(data.float().cuda(), data_rv.float().cuda(), data_va.float().cuda())
  File "/home/tue/20204239/77ghzradarpipeline/Radar-multiple-perspective-object-detection/env1/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tue/20204239/77ghzradarpipeline/Radar-multiple-perspective-object-detection/model/RODNet_3D.py", line 35, in forward
    x_ra = self.c3d_encode_ra(x_ra)
  File "/home/tue/20204239/77ghzradarpipeline/Radar-multiple-perspective-object-detection/env1/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tue/20204239/77ghzradarpipeline/Radar-multiple-perspective-object-detection/model/CDC.py", line 375, in forward
    x = self.relu(self.bn1a(self.conv1a(x)))  # (B, 2, W, 128, 128) -> (B, 64, W, 128, 128) Note: W~2W in this case
  File "/home/tue/20204239/77ghzradarpipeline/Radar-multiple-perspective-object-detection/env1/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tue/20204239/77ghzradarpipeline/Radar-multiple-perspective-object-detection/env1/lib64/python3.6/site-packages/torch/nn/modules/conv.py", line 485, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

The first search on stack overflow suggests that this is due to the lack of VRAM, which seems unreasonable because I run it on an 80GB machine with a batch size of 2 stackoverflow. My estimation is that it probably should take around ~6/8 GB to train with a batch size of 2.

The only difference is that I have not installed tensorboardX which I commented out in the code and, Pillow version 8.4.0 because installing Pillow==9.1.0 zero gives the following error:

Collecting pandas==1.1.3
  Using cached pandas-1.1.3-cp36-cp36m-manylinux1_x86_64.whl (9.5 MB)
ERROR: Could not find a version that satisfies the requirement Pillow==9.1.0 (from versions: 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7.0, 1.7.1, 1.7.2, 1.7.3, 1.7.4, 1.7.5, 1.7.6, 1.7.7, 1.7.8, 2.0.0, 2.1.0, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.0, 2.5.0, 2.5.1, 2.5.2, 2.5.3, 2.6.0, 2.6.1, 2.6.2, 2.7.0, 2.8.0, 2.8.1, 2.8.2, 2.9.0, 3.0.0, 3.1.0rc1, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.4.0, 3.4.1, 3.4.2, 4.0.0, 4.1.0, 4.1.1, 4.2.0, 4.2.1, 4.3.0, 5.0.0, 5.1.0, 5.2.0, 5.3.0, 5.4.0, 5.4.1, 6.0.0, 6.1.0, 6.2.0, 6.2.1, 6.2.2, 7.0.0, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 8.0.0, 8.0.1, 8.1.0, 8.1.1, 8.1.2, 8.2.0, 8.3.0, 8.3.1, 8.3.2, 8.4.0)
ERROR: No matching distribution found for Pillow==9.1.0

certifi==2020.6.20 charset-normalizer==2.0.12 cycler==0.10.0 dataclasses @ file:///tmp/build/80754af9/dataclasses_1614363715916/work docopt==0.6.2 idna==3.3 kiwisolver==1.2.0 matplotlib @ file:///tmp/build/80754af9/matplotlib-base_1597876339545/work mkl-fft==1.3.0 mkl-random==1.1.0 mkl-service==2.3.0 numpy @ file:///tmp/build/80754af9/numpy_and_numpy_base_1603487797006/work olefile==0.46 pandas @ file:///tmp/build/80754af9/pandas_1602088135163/work Pillow @ file:///tmp/build/80754af9/pillow_1614711422658/work pipreqs==0.4.11 protobuf==3.13.0 pyparsing==2.4.7 python-dateutil==2.8.1 pytz==2020.1 requests==2.27.1 scipy @ file:///tmp/build/80754af9/scipy_1597686625380/work sip==4.19.24 six @ file:///home/linux1/recipes/ci/six_1610970791821/work tensorboardX @ file:///home/conda/feedstock_root/build_artifacts/tensorboardx_1645578792360/work torch==1.5.1 torchvision==0.6.0a0+35d732a tornado==6.0.4 typing-extensions @ file:///home/ktietz/src/ci_mi/typing_extensions_1612808209620/work urllib3==1.26.9 yarg==0.1.9

jimvermunt commented 9 months ago

I created a work around, since I got it not working with the recommended packages. It is not the nicest solution but I got it working.

In train_dop.py file i changed the following:

dataloader = DataLoader(crdata_train, batch_size=batch_size, shuffle=True, num_workers=0) #-> old code
dataloader = DataLoader(crdata_train, batch_size=batch_size, shuffle=True, num_workers=0, collate_fn= lambda x:x)
data, data_rv, data_va, confmap_gt, obj_info, real_id = loaded_data #-> old code
data_list = [torch.from_numpy(loaded_data[batch_nr][0]).unsqueeze(0) for batch_nr in range(len(loaded_data))]
data_rv_list = [torch.from_numpy(loaded_data[batch_nr][1]).unsqueeze(0) for batch_nr in range(len(loaded_data))]
data_va_list = [torch.from_numpy(loaded_data[batch_nr][2]).unsqueeze(0) for batch_nr in range(len(loaded_data))]
confmap_gt_list = [torch.from_numpy(loaded_data[batch_nr][3]).unsqueeze(0) for batch_nr in range(len(loaded_data))]
# obj_inf not used somewhere so it is left out because conversion can be tricky
# real_id is only used for error capturing, not nice to let it be commented out inside the code by just creating a array of size of batch_size

data = torch.cat(data_list, dim=0)
data_rv = torch.cat(data_rv_list, dim=0)
data_va = torch.cat(data_va_list, dim=0)
confmap_gt = torch.cat(confmap_gt_list, dim=0)
E-Olcay commented 4 months ago

I created a work around, since I got it not working with the recommended packages. It is not the nicest solution but I got it working.

In train_dop.py file i changed the following:

dataloader = DataLoader(crdata_train, batch_size=batch_size, shuffle=True, num_workers=0) #-> old code
dataloader = DataLoader(crdata_train, batch_size=batch_size, shuffle=True, num_workers=0, collate_fn= lambda x:x)
data, data_rv, data_va, confmap_gt, obj_info, real_id = loaded_data #-> old code
data_list = [torch.from_numpy(loaded_data[batch_nr][0]).unsqueeze(0) for batch_nr in range(len(loaded_data))]
data_rv_list = [torch.from_numpy(loaded_data[batch_nr][1]).unsqueeze(0) for batch_nr in range(len(loaded_data))]
data_va_list = [torch.from_numpy(loaded_data[batch_nr][2]).unsqueeze(0) for batch_nr in range(len(loaded_data))]
confmap_gt_list = [torch.from_numpy(loaded_data[batch_nr][3]).unsqueeze(0) for batch_nr in range(len(loaded_data))]
# obj_inf not used somewhere so it is left out because conversion can be tricky
# real_id is only used for error capturing, not nice to let it be commented out inside the code by just creating a array of size of batch_size

data = torch.cat(data_list, dim=0)
data_rv = torch.cat(data_rv_list, dim=0)
data_va = torch.cat(data_va_list, dim=0)
confmap_gt = torch.cat(confmap_gt_list, dim=0)

I have a similar problem and followed your recommendation. Can you also share the rest of your modifications? obj_inf and real_id are used in many places.