Issue during training for multiple classes

NNtamp commented 2 years ago

Hi Sir. We faced an issue during our tries for training the model in multiple classes. We modified the sfd.yaml file based on the voxel_rcnn (please find attached the sfd.yaml file in text format) We received the following error message:

Traceback (most recent call last): File "train.py", line 200, in main() File "train.py", line 155, in main train_model( File "/workspace/SFD/tools/train_utils/train_utils.py", line 86, in train_model accumulated_iter = train_one_epoch( File "/workspace/SFD/tools/train_utils/train_utils.py", line 19, in train_one_epoch batch = next(dataloader_iter) File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 345, in next data = self._next_data() File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 856, in _next_data return self._process_data(data) File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 881, in _process_data data.reraise() File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) TypeError: Caught TypeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/workspace/SFD/pcdet/datasets/kitti/kitti_dataset_sfd.py", line 517, in collate_batch ret[key] = np.stack(val, axis=0) File "", line 5, in stack File "/usr/local/lib/python3.8/dist-packages/numpy/core/shape_base.py", line 427, in stack raise ValueError('all input arrays must have the same shape') ValueError: all input arrays must have the same shape

Do you have any solution please? We tried with batch size of 1 in the beggining but the model couldn't perform the batch normalization so we increased the batch size to 2. Also we have one single gpu.

sfd_yaml_file.txt

LittlePey commented 2 years ago

Hi, the current version of our code seems only support 1 sample on each GPU, could you show the error information after you train with batch size of 1.

NNtamp commented 2 years ago

Thank you for your answer. Attached you can find the traceback error after trying training with batch size of 1. message (6).txt

NNtamp commented 2 years ago

@LittlePey Does the above traceback error help you to understand the issue? Is there a solution? Thank you in advance.

NNtamp commented 2 years ago

@LittlePey Any idea?

LittlePey commented 2 years ago

Hi, it is a bug in SFD code when there is no pseudo point in any ROI. Sometimes, we just resume the latest checkpoint and the error will disappear.

NNtamp commented 2 years ago

Hi @LittlePey and thank you for your answer. To be honest I didn't understand your answer. I will try to repeat the issue. We tried to train the SFD for multiple classes and with batch size of 1 in a single gpu machine. We received the attached error. (We tried also with greater batch sizes but as you mentioned the current version of our code seems only support 1 sample on each GPU). How can we resume the latest checkpoint if the training doesn't start at all? Is there a solution? What do you think? Thank you in advance. message.6.txt

LittlePey commented 2 years ago

Hi, we didn't encounter your problem that the training doesn't start at all, maybe you can skip forward and backward when this situation happened.

NNtamp commented 2 years ago

Hi again @LittlePey . Any update on this? Is there a way to configure the training procedure with batch size of 2?

Dowe-dong commented 2 years ago

This bug is appear before the first epoch finished. So how can I resume the latest checkpoint?

LittlePey / SFD

Issue during training for multiple classes #20