Closed Eunsunggcu closed 1 year ago
I wanted to turn training in depth, so I tried to turn it by typing this bash in gmflow_scale1_train.sh in colab as an example.
!CHECKPOINT_DIR=checkpoints_flow/chairs-gmflow-scale1 && \ mkdir -p ${CHECKPOINT_DIR} && \python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=9989 main_depth.py --launcher pytorch --checkpoint_dir ${CHECKPOINT_DIR} --resume pretrained/gmflow-scale1-things-e9887eda.pth --no_resume_optimizer --dataset scannet --val_dataset scannet --image_size 480 640 --batch_size 8 --lr 4e-4 --summary_freq 100 --val_freq 5000 --save_ckpt_freq 5000 --num_steps 100000 2>&1 | tee -a
Result:
main_depth.py: error: unrecognized arguments: --local-rank=3 usage: main_depth.py [-h] [--checkpoint_dir CHECKPOINT_DIR] [--dataset DATASET] [--val_dataset VAL_DATASET [VAL_DATASET ...]] [--image_size IMAGE_SIZE [IMAGE_SIZE ...]] [--padding_factor PADDING_FACTOR] [--eval] [--demon_split DEMON_SPLIT] [--eval_min_depth EVAL_MIN_DEPTH] [--eval_max_depth EVAL_MAX_DEPTH] [--save_vis_depth] [--count_time] [--lr LR] [--batch_size BATCH_SIZE] [--weight_decay WEIGHT_DECAY] [--workers WORKERS] [--seed SEED] [--summary_freq SUMMARY_FREQ] [--save_ckpt_freq SAVE_CKPT_FREQ] [--save_latest_ckpt_freq SAVE_LATEST_CKPT_FREQ] [--val_freq VAL_FREQ] [--num_steps NUM_STEPS] [--resume RESUME] [--strict_resume] [--no_resume_optimizer] [--task TASK] [--num_scales NUM_SCALES] [--feature_channels FEATURE_CHANNELS] [--upsample_factor UPSAMPLE_FACTOR] [--num_head NUM_HEAD] [--ffn_dim_expansion FFN_DIM_EXPANSION] [--num_transformer_layers NUM_TRANSFORMER_LAYERS] [--reg_refine] [--attn_type ATTN_TYPE] [--attn_splits_list ATTN_SPLITS_LIST [ATTN_SPLITS_LIST ...]] [--min_depth MIN_DEPTH] [--max_depth MAX_DEPTH] [--num_depth_candidates NUM_DEPTH_CANDIDATES] [--prop_radius_list PROP_RADIUS_LIST [PROP_RADIUS_LIST ...]] [--num_reg_refine NUM_REG_REFINE] [--depth_loss_weight DEPTH_LOSS_WEIGHT] [--depth_grad_loss_weight DEPTH_GRAD_LOSS_WEIGHT] [--inference_dir INFERENCE_DIR] [--inference_size INFERENCE_SIZE [INFERENCE_SIZE ...]] [--output_path OUTPUT_PATH] [--depth_from_argmax] [--pred_bidir_depth] [--distributed] [--local_rank LOCAL_RANK] [--launcher LAUNCHER] [--gpu_ids GPU_IDS [GPU_IDS ...]] [--debug] main_depth.py: error: unrecognized arguments: --local-rank=1 usage: main_depth.py [-h] [--checkpoint_dir CHECKPOINT_DIR] [--dataset DATASET] [--val_dataset VAL_DATASET [VAL_DATASET ...]] [--image_size IMAGE_SIZE [IMAGE_SIZE ...]] [--padding_factor PADDING_FACTOR] [--eval] [--demon_split DEMON_SPLIT] [--eval_min_depth EVAL_MIN_DEPTH] [--eval_max_depth EVAL_MAX_DEPTH] [--save_vis_depth] [--count_time] [--lr LR] [--batch_size BATCH_SIZE] [--weight_decay WEIGHT_DECAY] [--workers WORKERS] [--seed SEED] [--summary_freq SUMMARY_FREQ] [--save_ckpt_freq SAVE_CKPT_FREQ] [--save_latest_ckpt_freq SAVE_LATEST_CKPT_FREQ] [--val_freq VAL_FREQ] [--num_steps NUM_STEPS] [--resume RESUME] [--strict_resume] [--no_resume_optimizer] [--task TASK] [--num_scales NUM_SCALES] [--feature_channels FEATURE_CHANNELS] [--upsample_factor UPSAMPLE_FACTOR] [--num_head NUM_HEAD] [--ffn_dim_expansion FFN_DIM_EXPANSION] [--num_transformer_layers NUM_TRANSFORMER_LAYERS] [--reg_refine] [--attn_type ATTN_TYPE] [--attn_splits_list ATTN_SPLITS_LIST [ATTN_SPLITS_LIST ...]] [--min_depth MIN_DEPTH] [--max_depth MAX_DEPTH] [--num_depth_candidates NUM_DEPTH_CANDIDATES] [--prop_radius_list PROP_RADIUS_LIST [PROP_RADIUS_LIST ...]] [--num_reg_refine NUM_REG_REFINE] [--depth_loss_weight DEPTH_LOSS_WEIGHT] [--depth_grad_loss_weight DEPTH_GRAD_LOSS_WEIGHT] [--inference_dir INFERENCE_DIR] [--inference_size INFERENCE_SIZE [INFERENCE_SIZE ...]] [--output_path OUTPUT_PATH] [--depth_from_argmax] [--pred_bidir_depth] [--distributed] [--local_rank LOCAL_RANK] [--launcher LAUNCHER] [--gpu_ids GPU_IDS [GPU_IDS ...]] [--debug] main_depth.py: error: unrecognized arguments: --local-rank=0 usage: main_depth.py [-h] [--checkpoint_dir CHECKPOINT_DIR] [--dataset DATASET] [--val_dataset VAL_DATASET [VAL_DATASET ...]] [--image_size IMAGE_SIZE [IMAGE_SIZE ...]] [--padding_factor PADDING_FACTOR] [--eval] [--demon_split DEMON_SPLIT] [--eval_min_depth EVAL_MIN_DEPTH] [--eval_max_depth EVAL_MAX_DEPTH] [--save_vis_depth] [--count_time] [--lr LR] [--batch_size BATCH_SIZE] [--weight_decay WEIGHT_DECAY] [--workers WORKERS] [--seed SEED] [--summary_freq SUMMARY_FREQ] [--save_ckpt_freq SAVE_CKPT_FREQ] [--save_latest_ckpt_freq SAVE_LATEST_CKPT_FREQ] [--val_freq VAL_FREQ] [--num_steps NUM_STEPS] [--resume RESUME] [--strict_resume] [--no_resume_optimizer] [--task TASK] [--num_scales NUM_SCALES] [--feature_channels FEATURE_CHANNELS] [--upsample_factor UPSAMPLE_FACTOR] [--num_head NUM_HEAD] [--ffn_dim_expansion FFN_DIM_EXPANSION] [--num_transformer_layers NUM_TRANSFORMER_LAYERS] [--reg_refine] [--attn_type ATTN_TYPE] [--attn_splits_list ATTN_SPLITS_LIST [ATTN_SPLITS_LIST ...]] [--min_depth MIN_DEPTH] [--max_depth MAX_DEPTH] [--num_depth_candidates NUM_DEPTH_CANDIDATES] [--prop_radius_list PROP_RADIUS_LIST [PROP_RADIUS_LIST ...]] [--num_reg_refine NUM_REG_REFINE] [--depth_loss_weight DEPTH_LOSS_WEIGHT] [--depth_grad_loss_weight DEPTH_GRAD_LOSS_WEIGHT] [--inference_dir INFERENCE_DIR] [--inference_size INFERENCE_SIZE [INFERENCE_SIZE ...]] [--output_path OUTPUT_PATH] [--depth_from_argmax] [--pred_bidir_depth] [--distributed] [--local_rank LOCAL_RANK] [--launcher LAUNCHER] [--gpu_ids GPU_IDS [GPU_IDS ...]] [--debug] main_depth.py: error: unrecognized arguments: --local-rank=2 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 8007) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 196, in <module> main() File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ main_depth.py FAILED
But this error came out. How can we solve this? Plz be my hero:)
Can you check whether there is a typo here:
I wanted to turn training in depth, so I tried to turn it by typing this bash in gmflow_scale1_train.sh in colab as an example.
!CHECKPOINT_DIR=checkpoints_flow/chairs-gmflow-scale1 && \ mkdir -p ${CHECKPOINT_DIR} && \python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=9989 main_depth.py --launcher pytorch --checkpoint_dir ${CHECKPOINT_DIR} --resume pretrained/gmflow-scale1-things-e9887eda.pth --no_resume_optimizer --dataset scannet --val_dataset scannet --image_size 480 640 --batch_size 8 --lr 4e-4 --summary_freq 100 --val_freq 5000 --save_ckpt_freq 5000 --num_steps 100000 2>&1 | tee -a
Result:
But this error came out. How can we solve this? Plz be my hero:)