distributed training error

gooners1886 commented 4 years ago

Hi, I try the distributed training with 2 machines. There are 4 GPUs in each machine. in the master machine, I run: python -u tools/run_net.py \ --cfg configs/Kinetics/SLOWFAST_8x8_R50.yaml \ --shard_id 0 \ --num_shards 2 \ --init_method tcp://127.0.0.1:40000 \ DATA.PATH_TO_DATA_DIR data/k400 \ DATA.PATH_PREFIX /root/data/datasets/k400_non_local_wangxiaolong/compress \ NUM_GPUS 4 \ TRAIN.BATCH_SIZE 32 \ TRAIN.EVAL_PERIOD 1

in the second machine, I run: python -u tools/run_net.py \ --cfg configs/Kinetics/SLOWFAST_8x8_R50.yaml \ --shard_id 1 \ --num_shards 2 \ --init_method tcp://10.110.17.113:50933 \ DATA.PATH_TO_DATA_DIR data/k400 DATA.PATH_PREFIX \ /root/data/datasets/k400_non_local_wangxiaolong/compress NUM_GPUS 4 \
TRAIN.BATCH_SIZE 32 \ TRAIN.EVAL_PERIOD 1

the log on the master is : [INFO: kinetics.py: 78]: Constructing Kinetics train... [INFO: kinetics.py: 113]: Constructing kinetics dataloader (size: 234584) from data/k400/train.csv [INFO: kinetics.py: 78]: Constructing Kinetics val... [INFO: kinetics.py: 113]: Constructing kinetics dataloader (size: 19760) from data/k400/val.csv [INFO: train_net.py: 307]: Start epoch: 1 then it hangs

the code and the data on both machines are totally the same. the world size is 8. I find the trainnig on both machines get pending at the the same code line :
optimizer.step()

Does anyone know how does that happen? What should I do ?

bqhuyy commented 3 years ago

I met the same problem :( Do anyone have solution?

FightAllDays commented 2 years ago

Hi！ I also met the same problem~Have you solved it？

facebookresearch / SlowFast

distributed training error #234