Running pre-training script with point contrast

martiiv commented 11 months ago

Hello!

I am trying to run the pre-training script listed in the codebase documentation. I am getting the following error message when trying to run the script:

Script:

sh scripts/PRETRAIN/dist_train_pointcontrast.sh 2 \ --cfg_file ./cfgs/once_models/unsupervised_model/pointcontrast_pvrcnn_res_plus_backbone.yaml \ --batch_size 2 \ --epochs 30

Error:

´ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 506209) of binary: /cluster/home/martiiv/deeplearningproject/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group´

I am using:

Two Nvidia A100 GPUS
CUDA 11.1
GCC 10.2
Python 3.8.6
Pytorch Version: 1.9.0+cu111

Has anyone else encountered this error?

Full message:

`+ NGPUS=2

PY_ARGS='--cfg_file ./cfgs/once_models/unsupervised_model/pointcontrast_pvrcnn_res_plus_backbone.yaml --batch_size 2 --epochs 15'
true
PORT=38966 ++ nc -z 127.0.0.1 38966 ++ echo 1
status=1
'[' 1 '!=' 0 ']'
break
echo 38966 38966
python -m torch.distributed.launch --nproc_per_node=2 --master_port=38966 train_pointcontrast.py --launcher pytorch --cfg_file ./cfgs/once_models/unsupervised_model/pointcontrast_pvrcnn_res_plus_backbone.yaml --batch_size 2 --epochs 15 /cluster/home/martiiv/deeplearningproject/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases. Please read local_rank from os.environ('LOCAL_RANK') instead. INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train_pointcontrast.py min_nodes : 1 max_nodes : 1 nproc_per_node : 2 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:38966 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_eeetvw03/none_ji9a7o3n INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /cluster/home/martiiv/deeplearningproject/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=38966 group_rank=0 group_world_size=1 local_ranks=[0, 1] role_ranks=[0, 1] global_ranks=[0, 1] role_world_sizes=[2, 2] global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_eeetvw03/none_ji9a7o3n/attempt_0/0/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_eeetvw03/none_ji9a7o3n/attempt_0/1/error.json program started program started 2023-11-13 12:58:15,729 train_pointcontrast.py main 91 INFO **Start logging** 2023-11-13 12:58:15,730 train_pointcontrast.py main 93 INFO CUDA_VISIBLE_DEVICES=0,1 2023-11-13 12:58:15,730 train_pointcontrast.py main 96 INFO total_batch_size: 2 2023-11-13 12:58:15,730 train_pointcontrast.py main 98 INFO cfg_file ./cfgs/once_models/unsupervised_model/pointcontrast_pvrcnn_res_plus_backbone.yaml 2023-11-13 12:58:15,730 train_pointcontrast.py main 98 INFO batch_size 1 2023-11-13 12:58:15,730 train_pointcontrast.py main 98 INFO epochs 15 2023-11-13 12:58:15,730 train_pointcontrast.py main 98 INFO workers 8 2023-11-13 12:58:15,730 train_pointcontrast.py main 98 INFO extra_tag default 2023-11-13 12:58:15,730 train_pointcontrast.py main 98 INFO ckpt None 2023-11-13 12:58:15,730 train_pointcontrast.py main 98 INFO pretrained_model None 2023-11-13 12:58:15,731 train_pointcontrast.py main 98 INFO launcher pytorch 2023-11-13 12:58:15,731 train_pointcontrast.py main 98 INFO tcp_port 18888 2023-11-13 12:58:15,731 train_pointcontrast.py main 98 INFO sync_bn False 2023-11-13 12:58:15,731 train_pointcontrast.py main 98 INFO fix_random_seed False 2023-11-13 12:58:15,731 train_pointcontrast.py main 98 INFO ckpt_save_interval 1 2023-11-13 12:58:15,731 train_pointcontrast.py main 98 INFO local_rank 0 2023-11-13 12:58:15,731 train_pointcontrast.py main 98 INFO max_ckpt_save_num 30 2023-11-13 12:58:15,731 train_pointcontrast.py main 98 INFO merge_all_iters_to_one_epoch False 2023-11-13 12:58:15,731 train_pointcontrast.py main 98 INFO set_cfgs None 2023-11-13 12:58:15,731 train_pointcontrast.py main 98 INFO max_waiting_mins 0 2023-11-13 12:58:15,731 train_pointcontrast.py main 98 INFO start_epoch 0 2023-11-13 12:58:15,731 train_pointcontrast.py main 98 INFO num_epochs_to_eval 0 2023-11-13 12:58:15,731 train_pointcontrast.py main 98 INFO save_to_file False 2023-11-13 12:58:15,731 config.py log_config_to_file 13 INFO cfg.ROOT_DIR: /cluster/home/martiiv/DeepLearningProject/3DTrans 2023-11-13 12:58:15,731 config.py log_config_to_file 13 INFO cfg.LOCAL_RANK: 0 2023-11-13 12:58:15,731 config.py log_config_to_file 13 INFO cfg.CLASS_NAMES: ['Vehicle', 'Pedestrian', 'Cyclist'] 2023-11-13 12:58:15,731 config.py log_config_to_file 13 INFO cfg.USE_PRETRAIN_MODEL: False 2023-11-13 12:58:15,731 config.py log_config_to_file 10 INFO
cfg.DATA_CONFIG = edict() 2023-11-13 12:58:15,731 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.DATASET: ONCEDataset 2023-11-13 12:58:15,731 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.DATA_PATH: ../data/once 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.LABELED_RATIO: 0 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.POINT_CLOUD_RANGE: [-75.2, -75.2, -5.0, 75.2, 75.2, 3.0] 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.VOXEL_SIZE: [0.1, 0.1, 0.2] 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.UNLABELED_DATA_FOR: ['teacher', 'student'] 2023-11-13 12:58:15,732 config.py log_config_to_file 10 INFO
cfg.DATA_CONFIG.INFO_PATH = edict() 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.INFO_PATH.train: ['once_infos_train_vehicle.pkl'] 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.INFO_PATH.val: ['once_infos_val_vehicle.pkl'] 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.INFO_PATH.test: ['once_infos_test.pkl'] 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.INFO_PATH.raw_small: ['once_infos_raw_small.pkl'] 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.INFO_PATH.raw_medium: ['once_infos_raw_medium.pkl'] 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.INFO_PATH.raw_large: ['once_infos_raw_large.pkl'] 2023-11-13 12:58:15,732 config.py log_config_to_file 10 INFO
cfg.DATA_CONFIG.DATA_SPLIT = edict() 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.DATA_SPLIT.train: train 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.DATA_SPLIT.test: val 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.DATA_SPLIT.raw: raw_small 2023-11-13 12:58:15,732 config.py log_config_to_file 10 INFO
cfg.DATA_CONFIG.POINT_FEATURE_ENCODING = edict() 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.POINT_FEATURE_ENCODING.encoding_type: absolute_coordinates_encoding 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.POINT_FEATURE_ENCODING.used_feature_list: ['x', 'y', 'z', 'intensity'] 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.POINT_FEATURE_ENCODING.src_feature_list: ['x', 'y', 'z', 'intensity'] 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.DATA_PROCESSOR: [{'NAME': 'mask_points_and_boxes_outside_range', 'REMOVE_OUTSIDE_BOXES': True}, {'NAME': 'shuffle_points', 'SHUFFLE_ENABLED': {'train': True, 'test': False}}, {'NAME': 'transform_points_to_voxels', 'VOXEL_SIZE': [0.1, 0.1, 0.2], 'MAX_POINTS_PER_VOXEL': 5, 'MAX_NUMBER_OF_VOXELS': {'train': 60000, 'test': 60000}}] 2023-11-13 12:58:15,732 config.py log_config_to_file 10 INFO
cfg.DATA_CONFIG.DATA_AUGMENTOR = edict() 2023-11-13 12:58:15,732 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.DATA_AUGMENTOR.DISABLE_AUG_LIST: ['placeholder'] 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.DATA_AUGMENTOR.AUG_CONFIG_LIST: [{'NAME': 'gt_sampling', 'USE_ROAD_PLANE': False, 'DB_INFO_PATH': ['once_dbinfos_train_vehicle.pkl'], 'PREPARE': {'filter_by_min_points': ['Car:5', 'Bus:5', 'Truck:5', 'Pedestrian:5', 'Cyclist:5']}, 'SAMPLE_GROUPS': ['Car:1', 'Bus:4', 'Truck:3', 'Pedestrian:2', 'Cyclist:2'], 'NUM_POINT_FEATURES': 4, 'REMOVE_EXTRA_WIDTH': [0.0, 0.0, 0.0], 'LIMIT_WHOLE_SCENE': True}, {'NAME': 'random_world_flip', 'ALONG_AXIS_LIST': ['x', 'y']}, {'NAME': 'random_world_rotation', 'WORLD_ROT_ANGLE': [-0.78539816, 0.78539816]}, {'NAME': 'random_world_scaling', 'WORLD_SCALE_RANGE': [0.95, 1.05]}] 2023-11-13 12:58:15,733 config.py log_config_to_file 10 INFO
cfg.DATA_CONFIG.TEACHER_AUGMENTOR = edict() 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.TEACHER_AUGMENTOR.DISABLE_AUG_LIST: ['random_world_scaling'] 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.TEACHER_AUGMENTOR.AUG_CONFIG_LIST: [{'NAME': 'random_world_scaling', 'WORLD_SCALE_RANGE': [0.95, 1.05]}] 2023-11-13 12:58:15,733 config.py log_config_to_file 10 INFO
cfg.DATA_CONFIG.STUDENT_AUGMENTOR = edict() 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.STUDENT_AUGMENTOR.DISABLE_AUG_LIST: ['placeholder'] 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.STUDENT_AUGMENTOR.AUG_CONFIG_LIST: [{'NAME': 'random_world_flip', 'ALONG_AXIS_LIST': ['x', 'y']}, {'NAME': 'random_world_rotation', 'WORLD_ROT_ANGLE': [-0.78539816, 0.78539816]}, {'NAME': 'random_world_scaling', 'WORLD_SCALE_RANGE': [0.95, 1.05]}] 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG._BASECONFIG: cfgs/dataset_configs/once/PRETRAIN/unsupervised_once_dataset.yaml 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.DATA_CONFIG.USE_PAIR_PROCESSOR: True 2023-11-13 12:58:15,733 config.py log_config_to_file 10 INFO
cfg.OPTIMIZATION = edict() 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.NUM_EPOCHS: 15 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.OPTIMIZER: adam_onecycle 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.LR: 0.001 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.WEIGHT_DECAY: 0.01 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.MOMENTUM: 0.9 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.MOMS: [0.95, 0.85] 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.PCT_START: 0.4 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.DIV_FACTOR: 10 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.DECAY_STEP_LIST: [35, 45] 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.LR_DECAY: 0.1 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.LR_CLIP: 1e-07 2023-11-13 12:58:15,733 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.LR_WARMUP: False 2023-11-13 12:58:15,734 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.WARMUP_EPOCH: -1 2023-11-13 12:58:15,734 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.GRAD_NORM_CLIP: 10 2023-11-13 12:58:15,734 config.py log_config_to_file 10 INFO
cfg.OPTIMIZATION.LOSS_CFG = edict() 2023-11-13 12:58:15,734 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.LOSS_CFG.POS_THRESH: 0.1 2023-11-13 12:58:15,734 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.LOSS_CFG.NEG_THRESH: 1.4 2023-11-13 12:58:15,734 config.py log_config_to_file 10 INFO
cfg.OPTIMIZATION.LOSS_CFG.SA_LAYER = edict() 2023-11-13 12:58:15,734 config.py log_config_to_file 10 INFO
cfg.OPTIMIZATION.LOSS_CFG.SA_LAYER.x_conv3 = edict() 2023-11-13 12:58:15,734 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.LOSS_CFG.SA_LAYER.x_conv3.DOWNSAMPLE_FACTOR: 4 2023-11-13 12:58:15,734 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.LOSS_CFG.SA_LAYER.x_conv3.POOL_RADIUS: [1.2] 2023-11-13 12:58:15,734 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.LOSS_CFG.SA_LAYER.x_conv3.NSAMPLE: [16] 2023-11-13 12:58:15,734 config.py log_config_to_file 10 INFO
cfg.OPTIMIZATION.LOSS_CFG.SA_LAYER.x_conv4 = edict() 2023-11-13 12:58:15,734 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.LOSS_CFG.SA_LAYER.x_conv4.DOWNSAMPLE_FACTOR: 8 2023-11-13 12:58:15,734 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.LOSS_CFG.SA_LAYER.x_conv4.POOL_RADIUS: [2.4] 2023-11-13 12:58:15,734 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.LOSS_CFG.SA_LAYER.x_conv4.NSAMPLE: [16] 2023-11-13 12:58:15,734 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.LOSS_CFG.FEATURES_SOURCE: ['bev'] 2023-11-13 12:58:15,734 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.LOSS_CFG.POINT_SOURCE: raw_points 2023-11-13 12:58:15,734 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.LOSS_CFG.NUM_KEYPOINTS: 2048 2023-11-13 12:58:15,734 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.LOSS_CFG.NUM_NEGATIVE_KEYPOINTS: 1024 2023-11-13 12:58:15,734 config.py log_config_to_file 10 INFO
cfg.OPTIMIZATION.TEST = edict() 2023-11-13 12:58:15,734 config.py log_config_to_file 13 INFO cfg.OPTIMIZATION.TEST.BATCH_SIZE_PER_GPU: 4 2023-11-13 12:58:15,734 config.py log_config_to_file 10 INFO
cfg.MODEL = edict() 2023-11-13 12:58:15,734 config.py log_config_to_file 13 INFO cfg.MODEL.NAME: PVRCNN_PLUS_BACKBONE 2023-11-13 12:58:15,734 config.py log_config_to_file 10 INFO
cfg.MODEL.VFE = edict() 2023-11-13 12:58:15,735 config.py log_config_to_file 13 INFO cfg.MODEL.VFE.NAME: MeanVFE 2023-11-13 12:58:15,735 config.py log_config_to_file 10 INFO
cfg.MODEL.BACKBONE_3D = edict() 2023-11-13 12:58:15,735 config.py log_config_to_file 13 INFO cfg.MODEL.BACKBONE_3D.NAME: VoxelResBackBone8x 2023-11-13 12:58:15,735 config.py log_config_to_file 10 INFO
cfg.MODEL.MAP_TO_BEV = edict() 2023-11-13 12:58:15,735 config.py log_config_to_file 13 INFO cfg.MODEL.MAP_TO_BEV.NAME: HeightCompression 2023-11-13 12:58:15,735 config.py log_config_to_file 13 INFO cfg.MODEL.MAP_TO_BEV.NUM_BEV_FEATURES: 256 2023-11-13 12:58:15,735 config.py log_config_to_file 13 INFO cfg.TAG: pointcontrast_pvrcnn_res_plus_backbone 2023-11-13 12:58:15,735 config.py log_config_to_file 13 INFO cfg.EXP_GROUP_PATH: cfgs/once_models/unsupervised_model ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 506209) of binary: /cluster/home/martiiv/deeplearningproject/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=1 master_addr=127.0.0.1 master_port=38966 group_rank=0 group_world_size=1 local_ranks=[0, 1] role_ranks=[0, 1] global_ranks=[0, 1] role_world_sizes=[2, 2] global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_eeetvw03/none_ji9a7o3n/attempt_1/0/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_eeetvw03/none_ji9a7o3n/attempt_1/1/error.json program started program started `

martiiv commented 11 months ago

Update: The code crashes in the following block: # build unsupervised dataloader datasets, dataloaders, samplers = build_unsupervised_dataloader( dataset_cfg=cfg.DATA_CONFIG, class_names=cfg.CLASS_NAMES, batch_size=args.batch_size, root_path=cfg.DATA_CONFIG.DATA_PATH, dist=dist_train, workers=args.workers, logger=logger, merge_all_iters_to_one_epoch=args.merge_all_iters_to_one_epoch )

Found in the tools/train_pointcontrast.py file

martiiv commented 11 months ago

Update: The code seems to crash on this specific code block in once_semi_dataset.py On line 64 and 65. The script sometimes manages to load the once_raw_data_small.pkl file and sometimes doesn't.

Do you have any idea why this might be happening @BOBrown?

martiiv commented 11 months ago

Update. Fixed the problem, it was due to a lack of GPU resources Thanks for help :)

PJLab-ADG / 3DTrans

Running pre-training script with point contrast #23