autonomousvision / unimatch

[TPAMI'23] Unifying Flow, Stereo and Depth Estimation
https://haofeixu.github.io/unimatch/
MIT License
980 stars 102 forks source link

Error about local-rank #21

Closed Eunsunggcu closed 1 year ago

Eunsunggcu commented 1 year ago

I wanted to turn training in depth, so I tried to turn it by typing this bash in gmflow_scale1_train.sh in colab as an example.

!CHECKPOINT_DIR=checkpoints_flow/chairs-gmflow-scale1 && \ mkdir -p ${CHECKPOINT_DIR} && \python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=9989 main_depth.py --launcher pytorch --checkpoint_dir ${CHECKPOINT_DIR} --resume pretrained/gmflow-scale1-things-e9887eda.pth --no_resume_optimizer --dataset scannet --val_dataset scannet --image_size 480 640 --batch_size 8 --lr 4e-4 --summary_freq 100 --val_freq 5000 --save_ckpt_freq 5000 --num_steps 100000 2>&1 | tee -a

Result:

main_depth.py: error: unrecognized arguments: --local-rank=3
usage: main_depth.py [-h] [--checkpoint_dir CHECKPOINT_DIR]
                     [--dataset DATASET]
                     [--val_dataset VAL_DATASET [VAL_DATASET ...]]
                     [--image_size IMAGE_SIZE [IMAGE_SIZE ...]]
                     [--padding_factor PADDING_FACTOR] [--eval]
                     [--demon_split DEMON_SPLIT]
                     [--eval_min_depth EVAL_MIN_DEPTH]
                     [--eval_max_depth EVAL_MAX_DEPTH] [--save_vis_depth]
                     [--count_time] [--lr LR] [--batch_size BATCH_SIZE]
                     [--weight_decay WEIGHT_DECAY] [--workers WORKERS]
                     [--seed SEED] [--summary_freq SUMMARY_FREQ]
                     [--save_ckpt_freq SAVE_CKPT_FREQ]
                     [--save_latest_ckpt_freq SAVE_LATEST_CKPT_FREQ]
                     [--val_freq VAL_FREQ] [--num_steps NUM_STEPS]
                     [--resume RESUME] [--strict_resume]
                     [--no_resume_optimizer] [--task TASK]
                     [--num_scales NUM_SCALES]
                     [--feature_channels FEATURE_CHANNELS]
                     [--upsample_factor UPSAMPLE_FACTOR] [--num_head NUM_HEAD]
                     [--ffn_dim_expansion FFN_DIM_EXPANSION]
                     [--num_transformer_layers NUM_TRANSFORMER_LAYERS]
                     [--reg_refine] [--attn_type ATTN_TYPE]
                     [--attn_splits_list ATTN_SPLITS_LIST [ATTN_SPLITS_LIST ...]]
                     [--min_depth MIN_DEPTH] [--max_depth MAX_DEPTH]
                     [--num_depth_candidates NUM_DEPTH_CANDIDATES]
                     [--prop_radius_list PROP_RADIUS_LIST [PROP_RADIUS_LIST ...]]
                     [--num_reg_refine NUM_REG_REFINE]
                     [--depth_loss_weight DEPTH_LOSS_WEIGHT]
                     [--depth_grad_loss_weight DEPTH_GRAD_LOSS_WEIGHT]
                     [--inference_dir INFERENCE_DIR]
                     [--inference_size INFERENCE_SIZE [INFERENCE_SIZE ...]]
                     [--output_path OUTPUT_PATH] [--depth_from_argmax]
                     [--pred_bidir_depth] [--distributed]
                     [--local_rank LOCAL_RANK] [--launcher LAUNCHER]
                     [--gpu_ids GPU_IDS [GPU_IDS ...]] [--debug]
main_depth.py: error: unrecognized arguments: --local-rank=1
usage: main_depth.py [-h] [--checkpoint_dir CHECKPOINT_DIR]
                     [--dataset DATASET]
                     [--val_dataset VAL_DATASET [VAL_DATASET ...]]
                     [--image_size IMAGE_SIZE [IMAGE_SIZE ...]]
                     [--padding_factor PADDING_FACTOR] [--eval]
                     [--demon_split DEMON_SPLIT]
                     [--eval_min_depth EVAL_MIN_DEPTH]
                     [--eval_max_depth EVAL_MAX_DEPTH] [--save_vis_depth]
                     [--count_time] [--lr LR] [--batch_size BATCH_SIZE]
                     [--weight_decay WEIGHT_DECAY] [--workers WORKERS]
                     [--seed SEED] [--summary_freq SUMMARY_FREQ]
                     [--save_ckpt_freq SAVE_CKPT_FREQ]
                     [--save_latest_ckpt_freq SAVE_LATEST_CKPT_FREQ]
                     [--val_freq VAL_FREQ] [--num_steps NUM_STEPS]
                     [--resume RESUME] [--strict_resume]
                     [--no_resume_optimizer] [--task TASK]
                     [--num_scales NUM_SCALES]
                     [--feature_channels FEATURE_CHANNELS]
                     [--upsample_factor UPSAMPLE_FACTOR] [--num_head NUM_HEAD]
                     [--ffn_dim_expansion FFN_DIM_EXPANSION]
                     [--num_transformer_layers NUM_TRANSFORMER_LAYERS]
                     [--reg_refine] [--attn_type ATTN_TYPE]
                     [--attn_splits_list ATTN_SPLITS_LIST [ATTN_SPLITS_LIST ...]]
                     [--min_depth MIN_DEPTH] [--max_depth MAX_DEPTH]
                     [--num_depth_candidates NUM_DEPTH_CANDIDATES]
                     [--prop_radius_list PROP_RADIUS_LIST [PROP_RADIUS_LIST ...]]
                     [--num_reg_refine NUM_REG_REFINE]
                     [--depth_loss_weight DEPTH_LOSS_WEIGHT]
                     [--depth_grad_loss_weight DEPTH_GRAD_LOSS_WEIGHT]
                     [--inference_dir INFERENCE_DIR]
                     [--inference_size INFERENCE_SIZE [INFERENCE_SIZE ...]]
                     [--output_path OUTPUT_PATH] [--depth_from_argmax]
                     [--pred_bidir_depth] [--distributed]
                     [--local_rank LOCAL_RANK] [--launcher LAUNCHER]
                     [--gpu_ids GPU_IDS [GPU_IDS ...]] [--debug]
main_depth.py: error: unrecognized arguments: --local-rank=0
usage: main_depth.py [-h] [--checkpoint_dir CHECKPOINT_DIR]
                     [--dataset DATASET]
                     [--val_dataset VAL_DATASET [VAL_DATASET ...]]
                     [--image_size IMAGE_SIZE [IMAGE_SIZE ...]]
                     [--padding_factor PADDING_FACTOR] [--eval]
                     [--demon_split DEMON_SPLIT]
                     [--eval_min_depth EVAL_MIN_DEPTH]
                     [--eval_max_depth EVAL_MAX_DEPTH] [--save_vis_depth]
                     [--count_time] [--lr LR] [--batch_size BATCH_SIZE]
                     [--weight_decay WEIGHT_DECAY] [--workers WORKERS]
                     [--seed SEED] [--summary_freq SUMMARY_FREQ]
                     [--save_ckpt_freq SAVE_CKPT_FREQ]
                     [--save_latest_ckpt_freq SAVE_LATEST_CKPT_FREQ]
                     [--val_freq VAL_FREQ] [--num_steps NUM_STEPS]
                     [--resume RESUME] [--strict_resume]
                     [--no_resume_optimizer] [--task TASK]
                     [--num_scales NUM_SCALES]
                     [--feature_channels FEATURE_CHANNELS]
                     [--upsample_factor UPSAMPLE_FACTOR] [--num_head NUM_HEAD]
                     [--ffn_dim_expansion FFN_DIM_EXPANSION]
                     [--num_transformer_layers NUM_TRANSFORMER_LAYERS]
                     [--reg_refine] [--attn_type ATTN_TYPE]
                     [--attn_splits_list ATTN_SPLITS_LIST [ATTN_SPLITS_LIST ...]]
                     [--min_depth MIN_DEPTH] [--max_depth MAX_DEPTH]
                     [--num_depth_candidates NUM_DEPTH_CANDIDATES]
                     [--prop_radius_list PROP_RADIUS_LIST [PROP_RADIUS_LIST ...]]
                     [--num_reg_refine NUM_REG_REFINE]
                     [--depth_loss_weight DEPTH_LOSS_WEIGHT]
                     [--depth_grad_loss_weight DEPTH_GRAD_LOSS_WEIGHT]
                     [--inference_dir INFERENCE_DIR]
                     [--inference_size INFERENCE_SIZE [INFERENCE_SIZE ...]]
                     [--output_path OUTPUT_PATH] [--depth_from_argmax]
                     [--pred_bidir_depth] [--distributed]
                     [--local_rank LOCAL_RANK] [--launcher LAUNCHER]
                     [--gpu_ids GPU_IDS [GPU_IDS ...]] [--debug]
main_depth.py: error: unrecognized arguments: --local-rank=2
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 8007) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main_depth.py FAILED

But this error came out. How can we solve this? Plz be my hero:)

haofeixu commented 1 year ago

Can you check whether there is a typo here: image