RuntimeError: operation does not have an identity in training.

ZeweiZhou commented 2 years ago

Hi! When I train the model with the nuScenes dataset, I meet the following issues. Do some settings lead to a empty tensor? I printed the tensor and get: tensor([16, 16], device='cuda:0', dtype=torch.int32) tensor([], device='cuda:0', dtype=torch.int32)

(ScePT) root@/ScePT-main/ScePT# python -m torch.distributed.launch --nproc_per_node=1 train.py --train_data_dict nuScenes_mini_train.pkl --eval_data_dict nuScenes_mini_val.pkl --offline_scene_graph yes --preprocess_workers 1 --log_dir ../experiments/nuScenes/models  --train_epochs 1 --augment --conf ../config/clique_nusc_config.json --indexing_workers=1 --batch_size=1 --vis_every=1 --map_encoding --incl_robot_node --eval_every=1
[20660]: world_size = 1, rank = 0, backend=nccl, port = 29500 
-----------------------
| TRAINING PARAMETERS |
-----------------------
| Batch Size: 1
| Eval Batch Size: 256
| Device: cuda:0
| Learning Rate: 0.0015
| Learning Rate Step Every: None
| Offline Scene Graph Calculation: yes
| MHL: 1
| PH: 8
-----------------------
Processing Scenes (1 CPUs): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [03:17<00:00, 24.73s/it]
Rank 0: Loaded training data from ../experiments/processed/nuScenes_mini_train.pkl
Processing Scenes (1 CPUs): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:22<00:00, 11.26s/it]
424
Rank 0: Created Training Model.
  0%|                                              | 0.00/1.80k [00:00<?, ?it/s]
  0%|                                              | 0.00/1.80k [00:01<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 587, in <module>
    spmd_main(args.local_rank)
  File "train.py", line 583, in spmd_main
    train(local_rank, args)
  File "train.py", line 361, in train
    train_loss = ScePT_model(batch)
  File "/root/miniconda3/envs/ScePT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/ScePT/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/root/miniconda3/envs/ScePT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/autodl-tmp/ScePT-main/ScePT/model/ScePT.py", line 70, in forward
    return self.train_loss(batch)
  File "/root/autodl-tmp/ScePT-main/ScePT/model/ScePT.py", line 88, in train_loss
    loss = self.model(
  File "/root/miniconda3/envs/ScePT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/autodl-tmp/ScePT-main/ScePT/model/mgcvae_clique.py", line 1507, in forward
    return self.train_loss(**kwargs)
  File "/root/autodl-tmp/ScePT-main/ScePT/model/mgcvae_clique.py", line 1860, in train_loss
    matching_loss = self.calc_traj_matching(
  File "/root/autodl-tmp/ScePT-main/ScePT/model/mgcvae_clique.py", line 1527, in calc_traj_matching
    N_z = int(torch.max(z_num[nt]))
RuntimeError: operation does not have an identity.
Killing subprocess 20660
Traceback (most recent call last):
  File "/root/miniconda3/envs/ScePT/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/ScePT/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/ScePT/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/root/miniconda3/envs/ScePT/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/root/miniconda3/envs/ScePT/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/miniconda3/envs/ScePT/bin/python', '-u', 'train.py', '--local_rank=0', '--train_data_dict', 'nuScenes_mini_train.pkl', '--eval_data_dict', 'nuScenes_mini_val.pkl', '--offline_scene_graph', 'yes', '--preprocess_workers', '1', '--log_dir', '../experiments/nuScenes/models', '--train_epochs', '1', '--augment', '--conf', '../config/clique_nusc_config.json', '--indexing_workers=1', '--batch_size=1', '--vis_every=1', '--map_encoding', '--incl_robot_node', '--eval_every=1']' returned non-zero exit status 1.

chenyx09 commented 2 years ago

I haven't encountered this bug, it seems that z_num is empty for a certain node type, perhaps a certain node type is missing either in the config or in the training data?

ZeweiZhou commented 2 years ago

I appreciated you getting back to me so promptly. Thank you so much!

Unfortunately, the issue still exists. When I print the z_num, I get: VEHICLE: tensor([ 4, 20, 20, 20], device='cuda:0', dtype=torch.int32), PEDESTRIAN: tensor([], device='cuda:0', dtype=torch.int32) Could you tell me whether some special setting should be done for the pedestrian data in the nuScenes?

And I found the result of process data is:

Processing mini_train Scenes (2 CPUs): 100%|█████████████| 8/8 [02:36<00:00, 19.61s/it]
Processed 8 scenes
Saved Environment!
Total Nodes: 0
Curvature > 0.1 Nodes: 0
Curvature > 0.2 Nodes: 0
Preprocessing mini_val Samples: 100%|████████████████| 61/61 [00:00<00:00, 9899.12it/s]
Processing mini_val Scenes (2 CPUs): 100%|███████████████| 2/2 [00:32<00:00, 16.31s/it]
Processed 2 scenes
Saved Environment!
Total Nodes: 0
Curvature > 0.1 Nodes: 0
Curvature > 0.2 Nodes: 0

I would like to know whether it is normal for the result of total nodes?

chenyx09 commented 2 years ago

Oh, I think this is because of your batch size. The code assumes that there is some pedestrian and some vehicle, and there is no pedestrian in your batch. I just pushed a fix, and hope that solves the problem. In any case, using a larger batch size would practically solve the problem.

NVlabs / ScePT

RuntimeError: operation does not have an identity in training. #3