Tsinghua-MARS-Lab / DenseTNT

MIT License
459 stars 119 forks source link

Multi-GPU training does not move on this interface #36

Open XiaomuWang opened 1 year ago

XiaomuWang commented 1 year ago

image root@18dc3f8e2e1d:/workspace/wangs/DenseTNT# python src/run.py --argoverse --future_frame_num 30 --do_train --data_dir /workspace/datasets/Argoverse/train/data/ --output_dir models.densetnt.1 --hidden_size 128 --train_batch_size 64 --use_map --core_num 16 --use_centerline --distributed_training 8 --other_params semantic_lane direction l1_loss goals_2D enhance_global_graph subdivide goal_scoring laneGCN point_sub_graph lane_scoring complete_traj complete_traj-3 {'add_prefix': None, 'agent_type': None, 'argoverse': True, 'attention_decay': False, 'autoregression': None, 'core_num': 16, 'cuda_visible_device_num': None, 'data_dir': '/workspace/datasets/Argoverse/train/data/', 'data_dir_for_val': 'val/data/', 'debug': False, 'distributed_training': 8, 'do_eval': False, 'do_test': False, 'do_train': True, 'eval_batch_size': 64, 'eval_params': [], 'future_frame_num': 30, 'future_test_frame_num': 16, 'global_graph_depth': 1, 'gpu_split': 0, 'hidden_dropout_prob': 0.1, 'hidden_size': 128, 'initializer_range': 0.02, 'inter_agent_types': None, 'learning_rate': 0.001, 'log_dir': 'models.densetnt.1', 'lstm': False, 'master_port': '12355', 'max_distance': 50.0, 'method_span': [0, 1], 'mode_num': 6, 'model_recover_path': None, 'model_save_dir': 'models.densetnt.1/model_save', 'multi': None, 'nms_threshold': None, 'no_agents': False, 'no_cuda': False, 'no_sub_graph': False, 'not_use_api': False, 'num_train_epochs': 16.0, 'nuscenes': False, 'old_version': False, 'other_params': {'semantic_lane': True, 'direction': True, 'l1_loss': True, 'goals_2D': True, 'enhance_global_graph': True, 'subdivide': True, 'goal_scoring': True, 'laneGCN': True, 'point_sub_graph': True, 'lane_scoring': True, 'complete_traj': True, 'complete_traj-3': True}, 'output_dir': 'models.densetnt.1', 'placeholder': 0.0, 'reuse_temp_file': False, 'seed': 42, 'single_agent': True, 'stage_one_K': None, 'sub_graph_batch_size': 8000, 'sub_graph_depth': 3, 'temp_file_dir': 'models.densetnt.1/temp_file', 'train_batch_size': 64, 'train_extra': False, 'train_params': [], 'use_centerline': True, 'use_map': True, 'visualize': False, 'waymo': False, 'weight_decay': 0.01}

10/21/2022 01:57:04 - INFO - main - args output_dir models.densetnt.1 other_params ['semantic_lane', 'direction', 'l1_loss', 'goals_2D', 'enhance_global_graph', 'subdivide', 'goal_scoring', 'laneGCN', 'point_sub_graph', 'lane_scoring', 'complete_traj', 'complete_traj-3'] 10/21/2022 01:57:11 - INFO - main - device: cuda Loading dataset ['/workspace/datasets/Argoverse/train/data/'] /opt/conda/lib/python3.8/site-packages/scipy/init.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.4) warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of " 10/21/2022 01:57:12 - INFO - argoverse.data_loading.vector_map_loader - Loaded root: ArgoverseVectorMap Running DDP on rank 3. Running DDP on rank 5. Running DDP on rank 1. Running DDP on rank 7. Running DDP on rank 0. Running DDP on rank 6. Running DDP on rank 4. Running DDP on rank 2. 10/21/2022 01:57:13 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 6 10/21/2022 01:57:13 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 4 10/21/2022 01:57:13 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 2 10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 3 10/21/2022 01:57:14 - INFO - argoverse.data_loading.vector_map_loader - Loaded root: ArgoverseVectorMap 10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 5 10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 1 10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 0 10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 7 10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. ['/workspace/datasets/Argoverse/train/data/129892.csv', '/workspace/datasets/Argoverse/train/data/179439.csv', '/workspace/datasets/Argoverse/train/data/153379.csv', '/workspace/datasets/Argoverse/train/data/11971.csv', '/workspace/datasets/Argoverse/train/data/181683.csv'] ['/workspace/datasets/Argoverse/train/data/209097.csv', '/workspace/datasets/Argoverse/train/data/102649.csv', '/workspace/datasets/Argoverse/train/data/186077.csv', '/workspace/datasets/Argoverse/train/data/74459.csv', '/workspace/datasets/Argoverse/train/data/89887.csv'] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 205942/205942 [06:12<00:00, 552.14it/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 205942/205942 [00:07<00:00, 27049.96it/s] valid data size is 205942

XiaomuWang commented 1 year ago

The program will freeze here for a long time

GentleSmile commented 1 year ago

What about using gloo as backend? It may be due to a NCCL error.