Open Git-oNmE opened 1 year ago
Sure, by setting CUDA_VISIBLE_DEVICES and nproc_pernode is ok~ (^^)
Thanks for your reply :)
Hi, I set CUDA_VISIBLE_DEVICES and nproc_per_node, but it doesn't work. It just stop at this scene, no error message and no train message, too.
Could you please figure me out what is wrong here?
My train command is :
CUDA_VISIBLE_DEVICES=5,6 python -m torch.distributed.launch --master_port 10024 --nproc_per_node=2 tools/relation_train_net.py --config-file "configs/SHA_GCL_e2e_relation_X_101_32_8_FPN_1x.yaml" GLOBAL_SETTING.DATASET_CHOICE 'VG' GLOBAL_SETTING.RELATION_PREDICTOR 'TransLike_GCL' GLOBAL_SETTING.BASIC_ENCODER 'Hybrid-Attention' GLOBAL_SETTING.GCL_SETTING.GROUP_SPLIT_MODE 'divide4' GLOBAL_SETTING.GCL_SETTING.KNOWLEDGE_TRANSFER_MODE 'KL_logit_TopDown' MODEL.ROI_RELATION_HEAD.USE_GT_BOX True MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL True SOLVER.IMS_PER_BATCH 8 TEST.IMS_PER_BATCH 8 DTYPE "float16" SOLVER.MAX_ITER 60000 SOLVER.VAL_PERIOD 2000 SOLVER.CHECKPOINT_PERIOD 4000 GLOVE_DIR /media/data3/hlf_data/SHAGCL/SHA-GCL-for-SGG/datasets/vg/glove OUTPUT_DIR /media/data3/hlf_data/SHAGCL/SHA-GCL-for-SGG/output/PredCls_train
.
Thx :)
Hi, I set CUDA_VISIBLE_DEVICES and nproc_per_node, but it doesn't work. It just stop at this scene, no error message and no train message, too. Could you please figure me out what is wrong here?
My train command is :
CUDA_VISIBLE_DEVICES=5,6 python -m torch.distributed.launch --master_port 10024 --nproc_per_node=2 tools/relation_train_net.py --config-file "configs/SHA_GCL_e2e_relation_X_101_32_8_FPN_1x.yaml" GLOBAL_SETTING.DATASET_CHOICE 'VG' GLOBAL_SETTING.RELATION_PREDICTOR 'TransLike_GCL' GLOBAL_SETTING.BASIC_ENCODER 'Hybrid-Attention' GLOBAL_SETTING.GCL_SETTING.GROUP_SPLIT_MODE 'divide4' GLOBAL_SETTING.GCL_SETTING.KNOWLEDGE_TRANSFER_MODE 'KL_logit_TopDown' MODEL.ROI_RELATION_HEAD.USE_GT_BOX True MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL True SOLVER.IMS_PER_BATCH 8 TEST.IMS_PER_BATCH 8 DTYPE "float16" SOLVER.MAX_ITER 60000 SOLVER.VAL_PERIOD 2000 SOLVER.CHECKPOINT_PERIOD 4000 GLOVE_DIR /media/data3/hlf_data/SHAGCL/SHA-GCL-for-SGG/datasets/vg/glove OUTPUT_DIR /media/data3/hlf_data/SHAGCL/SHA-GCL-for-SGG/output/PredCls_train
. Thx :)
May I ask if you have solved this problem?
I’m running into the same problem. My run also stopped without any error message. My PyTorch version is 1.8.2 LTS. The single GPU run is successful however.
Seems like a problem with the losses.backward() call as the hang happened during it. I still haven't figured out why.
With one gpu, my train process is too slow. Appreciate for replying :)