如何实现分布式训练指定多卡？

你好，感谢你的关注！请参考我们在Readme给出的训练指令，确保使用了这些指令才能正确运行多卡。

如果想指定显卡，请参考此方式设置训练指令（格式不一定完全正确，我没有测试，可能需要微调具体写法。具体可以参考该链接： [知乎]在pytorch中指定显卡）：

CUDA_VISIBLE_DEVICES=3,4,5 \
torchrun \
  --standalone \
  --nnodes=1 \
  --nproc_per_node=3 \
main_train.py \
  --world_size 1 \
  --batch_size 1 \
  --data_path "<Your custom dataset path>/CASIA2.0" \
  --epochs 200 \
  --lr 1e-4 \
  --min_lr 5e-7 \
  --weight_decay 0.05 \
  --edge_lambda 20 \
  --predict_head_norm "BN" \
  --vit_pretrain_path "<Your path to pretrained weights >/mae_pretrain_vit_base.pth" \
  --test_data_path "<Your custom dataset path>/CASIA1.0" \
  --warmup_epochs 4 \
  --output_dir ./output_dir/ \
  --log_dir ./output_dir/  \
  --accum_iter 32 \
  --seed 42 \
  --test_period 4 \
  --num_workers 4 \
  2> train_error.log 1>train_log.log

Not using distributed mode 来自于这个位置，请确认你是否按照上述方式启动了训练过程，尤其是多卡相关的参数有没有通过命令行设置，可以参考上面的if条件来解决问题。 https://github.com/SunnyHaze/IML-ViT/blob/3ffd03db8b95824ce0b67c55ee1628ec106a6666/utils/misc.py#L236

祝好，有更细节的信息可以在issue下进一步讨论。

SunnyHaze / IML-ViT

如何实现分布式训练指定多卡？ #11