autonomousvision / carla_garage

[ICCV'23] Hidden Biases of End-to-End Driving Models
MIT License
203 stars 16 forks source link

RuntimeError: stack expects each tensor to be equal size, but got [2│ 56, 256] at entry 0 and [1, 11] at entry 11 #21

Closed YounghwaJung closed 7 months ago

YounghwaJung commented 8 months ago

Hi, I am trying to reproduce the results by training the model. However, while training, an error occurred, and I am uncertain about the cause. Could you please suggest any possible solutions?

Traceback (most recent call last):
File "train.py", line 1019, in
main()
File "/data/anaconda3/envs/garage/lib/python3.7/site-packages/torc h/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "train.py", line 624, in main
trainer.train()
File "train.py", line 883, in train
for i, data in enumerate(tqdm(self.dataloader_train, disable=sel f.rank != 0)):
File "/data/anaconda3/envs/garage/lib/python3.7/site-packages/tqdm /std.py", line 1183, in iter
for obj in iterable:
File "/data/anaconda3/envs/garage/lib/python3.7/site-packages/torc h/utils/data/dataloader.py", line 681, in next
data = self._next_data()
File "/data/anaconda3/envs/garage/lib/python3.7/site-packages/torc h/utils/data/dataloader.py", line 721, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopItera tion
File "/data/anaconda3/envs/garage/lib/python3.7/site-packages/torc h/utils/data/_utils/fetch.py", line 52, in fetch
return self.collate_fn(data)
File "/data/anaconda3/envs/garage/lib/python3.7/site-packages/torc h/utils/data/_utils/collate.py", line 160, in default_collate
return elem_type({key: default_collate([d[key] for d in batch]) for key in elem})
File "/data/anaconda3/envs/garage/lib/python3.7/site-packages/torc h/utils/data/_utils/collate.py", line 160, in
return elem_type({key: default_collate([d[key] for d in batch]) for key in elem})
File "/data/anaconda3/envs/garage/lib/python3.7/site-packages/torc h/utils/data/_utils/collate.py", line 149, in default_collate
return default_collate([torch.as_tensor(b) for b in batch])
File "/data/anaconda3/envs/garage/lib/python3.7/site-packages/torc h/utils/data/_utils/collate.py", line 141, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [2 56, 256] at entry 0 and [1, 11] at entry 11

Kait0 commented 8 months ago

Hard to tell from just that error what is going wrong. Some shape mismatch when the pytorch dataloader is batching individual samples together. What was the command you used to start training? Were any changes made to the code?

YounghwaJung commented 8 months ago

I used the team_code/shell_train.sh and made following modifications because I encountered the same error https://github.com/autonomousvision/carla_garage/issues/12.

torchrun --nnodes=1 --nproc_per_node 8 --max_restarts=1 --rdzv_id=42353467 --rdzv_backend=c10d train.py --id train_id_000 --batch_size 8 --setting 02_05_withheld --root_dir /dataset/carla_garage --logdir logs --use_controller_input_prediction 1 --use_wp_gru 0 --use_discrete_command 1 --use_tp 1 --continue_epoch 1 --cpu_cores 0 --num_repetitions 3

The training runs smoothly for thousands of steps, but errors occur randomly. Is it possible that the datasets got corrupted during the download process?

Noce99 commented 8 months ago

I also encountered the same error #12, I have modified shell_train.sh as suggested and now I'm obtaining as well: stack expects each tensor to be equal size, but got [256, 256] at entry 0 and [1, 11] at entry 6 randomly during training. I paste my shell_train.sh in the following:

torchrun --nnodes=1 --nproc_per_node=1 --max_restarts=1 --rdzv_id=42353467 --rdzv_backend=c10d train.py --id train_id_000 --batch_size 12 --setting 02_05_withheld --root_dir /home/enrico/Projects/Carla/carla_garage/data --logdir /home/enrico/Projects/Carla/carla_garage/logs --use_controller_input_prediction 1 --use_wp_gru 0 --use_discrete_command 1 --use_tp 1 --continue_epoch 1 --cpu_cores 0 --num_repetitions 3

Kait0 commented 8 months ago

In principle its prossible that something got correpted during download or unzipping. some of the data loading libraries fail silently and return none when they can't load data.

You can check that by adding asserts for None or similar here.

Since you two also seem to have problems with the pytorch multiprocessing, which multiprocessing method is selected by your system (gets printed in the beginning with "Start method of multiprocessing:")?

If it is possible for you to debug and see which data it is that crashes the batching would also be helpful.

@Noce99 Unrelated to your problem but you are using a very small batch size (total of 12). Total batch size is: num gpu x batch size, if you want to train with 1 GPU you probably need to increase batch size or if not possible reduce the learning rate proportionally (to compensate for noisier gradient and more total gradient steps).

Noce99 commented 8 months ago
  1. Start method of multiprocessing: fork
  2. II launched a training with some debug and asserts to discover if the problem is in the data. I will let you know!
  3. @Kait0 Really thank you for the suggestions. I know that one GPU is not enough but I don't have nothing more for now ;-(. Just wanted to try the training process knowing that was not enough. I didn't think about decreasing the learning rate but I will try something like --lr 7.5e-5 for sure thanks!
Kait0 commented 8 months ago

fork is the python default on linux. It works for me but you can also try the newer "spawn" or "forkserver" methods, since you seem to have problems with multithreading (setting --cpu_cores 0 might slow down training since data loading isn't parallelized). You can set the methods here.

Noce99 commented 8 months ago

I have solved the problem by checking the shape of the data before using them. It turns out that just one element was corrupted. In the following how I modified data.py.

...
class CARLA_Data(Dataset):
...
 def __init__(self,
...
    self.last_data = None
...
def __getitem__(self, index):
...
    wanted_sizes = {'semantic': (256, 1024), 'bev_semantic': (256, 256), 'depth': (256, 1024), 'rgb': (3, 256, 1024), 'center_heatmap': (4, 64, 64), 'wh': (2, 64, 64), 'yaw_class': (64, 64), 'yaw_res': (1, 64, 64), 'offset': (2, 64, 64), 'velocity': (1, 64, 64), 'brake_target': (64, 64), 'pixel_weight': (2, 64, 64), 'lidar': (1, 256, 256), 'bounding_boxes': (30, 8), 'command': (6,), 'next_command': (6,), 'route': (20, 2), 'target_point': (2,), 'aim_wp': (2,)}
    for wanted_key in wanted_sizes:
        if tuple(data[wanted_key].shape) != wanted_sizes[wanted_key]:
            print(f"Bad {wanted_key}")
            return self.last_data
    self.last_data = data
    return data
...

It turns out that the not matching shape was a single bev_semantic.

Kait0 commented 8 months ago

Great. Can you share which file was corrupt in case its the same for multiple people?

buaazeus commented 6 months ago

s1_dataset_2023_05_10/Routes_Town04_Scenario1_Repetition2/Town04_Scenario1_route49_05_11_16_31_34/bev_semantics_augmented/0039.png