Hi, I try the distributed training with 2 machines. There are 4 GPUs in each machine.
in the master machine, I run:
python -u tools/run_net.py \
--cfg configs/Kinetics/SLOWFAST_8x8_R50.yaml \
--shard_id 0 \
--num_shards 2 \
--init_method tcp://127.0.0.1:40000 \
DATA.PATH_TO_DATA_DIR data/k400 \
DATA.PATH_PREFIX /root/data/datasets/k400_non_local_wangxiaolong/compress \
NUM_GPUS 4 \
TRAIN.BATCH_SIZE 32 \
TRAIN.EVAL_PERIOD 1
in the second machine, I run:
python -u tools/run_net.py \
--cfg configs/Kinetics/SLOWFAST_8x8_R50.yaml \
--shard_id 1 \
--num_shards 2 \
--init_method tcp://10.110.17.113:50933 \
DATA.PATH_TO_DATA_DIR data/k400 DATA.PATH_PREFIX \
/root/data/datasets/k400_non_local_wangxiaolong/compress NUM_GPUS 4 \
TRAIN.BATCH_SIZE 32 \
TRAIN.EVAL_PERIOD 1
the log on the master is :
[INFO: kinetics.py: 78]: Constructing Kinetics train...
[INFO: kinetics.py: 113]: Constructing kinetics dataloader (size: 234584) from data/k400/train.csv
[INFO: kinetics.py: 78]: Constructing Kinetics val...
[INFO: kinetics.py: 113]: Constructing kinetics dataloader (size: 19760) from data/k400/val.csv
[INFO: train_net.py: 307]: Start epoch: 1
then it hangs
the code and the data on both machines are totally the same.
the world size is 8.
I find the trainnig on both machines get pending at the the same code line :
optimizer.step()
Does anyone know how does that happen? What should I do ?
Hi, I try the distributed training with 2 machines. There are 4 GPUs in each machine. in the master machine, I run: python -u tools/run_net.py \ --cfg configs/Kinetics/SLOWFAST_8x8_R50.yaml \ --shard_id 0 \ --num_shards 2 \ --init_method tcp://127.0.0.1:40000 \ DATA.PATH_TO_DATA_DIR data/k400 \ DATA.PATH_PREFIX /root/data/datasets/k400_non_local_wangxiaolong/compress \ NUM_GPUS 4 \ TRAIN.BATCH_SIZE 32 \ TRAIN.EVAL_PERIOD 1
in the second machine, I run: python -u tools/run_net.py \ --cfg configs/Kinetics/SLOWFAST_8x8_R50.yaml \ --shard_id 1 \ --num_shards 2 \ --init_method tcp://10.110.17.113:50933 \ DATA.PATH_TO_DATA_DIR data/k400 DATA.PATH_PREFIX \ /root/data/datasets/k400_non_local_wangxiaolong/compress NUM_GPUS 4 \
TRAIN.BATCH_SIZE 32 \ TRAIN.EVAL_PERIOD 1
the log on the master is : [INFO: kinetics.py: 78]: Constructing Kinetics train... [INFO: kinetics.py: 113]: Constructing kinetics dataloader (size: 234584) from data/k400/train.csv [INFO: kinetics.py: 78]: Constructing Kinetics val... [INFO: kinetics.py: 113]: Constructing kinetics dataloader (size: 19760) from data/k400/val.csv [INFO: train_net.py: 307]: Start epoch: 1 then it hangs
the code and the data on both machines are totally the same. the world size is 8. I find the trainnig on both machines get pending at the the same code line :
optimizer.step()
Does anyone know how does that happen? What should I do ?