Open NNtamp opened 2 years ago
Hi, the current version of our code seems only support 1 sample on each GPU, could you show the error information after you train with batch size of 1.
Thank you for your answer. Attached you can find the traceback error after trying training with batch size of 1. message (6).txt
@LittlePey Does the above traceback error help you to understand the issue? Is there a solution? Thank you in advance.
@LittlePey Any idea?
Hi, it is a bug in SFD code when there is no pseudo point in any ROI. Sometimes, we just resume the latest checkpoint and the error will disappear.
Hi @LittlePey and thank you for your answer. To be honest I didn't understand your answer. I will try to repeat the issue. We tried to train the SFD for multiple classes and with batch size of 1 in a single gpu machine. We received the attached error. (We tried also with greater batch sizes but as you mentioned the current version of our code seems only support 1 sample on each GPU). How can we resume the latest checkpoint if the training doesn't start at all? Is there a solution? What do you think? Thank you in advance. message.6.txt
Hi, we didn't encounter your problem that the training doesn't start at all, maybe you can skip forward and backward when this situation happened.
Hi again @LittlePey . Any update on this? Is there a way to configure the training procedure with batch size of 2?
This bug is appear before the first epoch finished. So how can I resume the latest checkpoint?
Hi Sir. We faced an issue during our tries for training the model in multiple classes. We modified the sfd.yaml file based on the voxel_rcnn (please find attached the sfd.yaml file in text format) We received the following error message:
Traceback (most recent call last): File "train.py", line 200, in
main()
File "train.py", line 155, in main
train_model(
File "/workspace/SFD/tools/train_utils/train_utils.py", line 86, in train_model
accumulated_iter = train_one_epoch(
File "/workspace/SFD/tools/train_utils/train_utils.py", line 19, in train_one_epoch
batch = next(dataloader_iter)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/workspace/SFD/pcdet/datasets/kitti/kitti_dataset_sfd.py", line 517, in collate_batch
ret[key] = np.stack(val, axis=0)
File "", line 5, in stack
File "/usr/local/lib/python3.8/dist-packages/numpy/core/shape_base.py", line 427, in stack
raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape
Do you have any solution please? We tried with batch size of 1 in the beggining but the model couldn't perform the batch normalization so we increased the batch size to 2. Also we have one single gpu.
sfd_yaml_file.txt