dongxingning / SHA-GCL-for-SGG

Code for paper "Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation"
MIT License
32 stars 5 forks source link

Can this work be done with multiple gpus? #8

Open Git-oNmE opened 1 year ago

Git-oNmE commented 1 year ago

With one gpu, my train process is too slow. Appreciate for replying :)

dongxingning commented 1 year ago

Sure, by setting CUDA_VISIBLE_DEVICES and nproc_pernode is ok~ (^^)

Git-oNmE commented 1 year ago

Thanks for your reply :)

Git-oNmE commented 1 year ago

image Hi, I set CUDA_VISIBLE_DEVICES and nproc_per_node, but it doesn't work. It just stop at this scene, no error message and no train message, too. Could you please figure me out what is wrong here?

My train command is : CUDA_VISIBLE_DEVICES=5,6 python -m torch.distributed.launch --master_port 10024 --nproc_per_node=2 tools/relation_train_net.py --config-file "configs/SHA_GCL_e2e_relation_X_101_32_8_FPN_1x.yaml" GLOBAL_SETTING.DATASET_CHOICE 'VG' GLOBAL_SETTING.RELATION_PREDICTOR 'TransLike_GCL' GLOBAL_SETTING.BASIC_ENCODER 'Hybrid-Attention' GLOBAL_SETTING.GCL_SETTING.GROUP_SPLIT_MODE 'divide4' GLOBAL_SETTING.GCL_SETTING.KNOWLEDGE_TRANSFER_MODE 'KL_logit_TopDown' MODEL.ROI_RELATION_HEAD.USE_GT_BOX True MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL True SOLVER.IMS_PER_BATCH 8 TEST.IMS_PER_BATCH 8 DTYPE "float16" SOLVER.MAX_ITER 60000 SOLVER.VAL_PERIOD 2000 SOLVER.CHECKPOINT_PERIOD 4000 GLOVE_DIR /media/data3/hlf_data/SHAGCL/SHA-GCL-for-SGG/datasets/vg/glove OUTPUT_DIR /media/data3/hlf_data/SHAGCL/SHA-GCL-for-SGG/output/PredCls_train. Thx :)

Zhuzi24 commented 1 year ago

image Hi, I set CUDA_VISIBLE_DEVICES and nproc_per_node, but it doesn't work. It just stop at this scene, no error message and no train message, too. Could you please figure me out what is wrong here?

My train command is : CUDA_VISIBLE_DEVICES=5,6 python -m torch.distributed.launch --master_port 10024 --nproc_per_node=2 tools/relation_train_net.py --config-file "configs/SHA_GCL_e2e_relation_X_101_32_8_FPN_1x.yaml" GLOBAL_SETTING.DATASET_CHOICE 'VG' GLOBAL_SETTING.RELATION_PREDICTOR 'TransLike_GCL' GLOBAL_SETTING.BASIC_ENCODER 'Hybrid-Attention' GLOBAL_SETTING.GCL_SETTING.GROUP_SPLIT_MODE 'divide4' GLOBAL_SETTING.GCL_SETTING.KNOWLEDGE_TRANSFER_MODE 'KL_logit_TopDown' MODEL.ROI_RELATION_HEAD.USE_GT_BOX True MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL True SOLVER.IMS_PER_BATCH 8 TEST.IMS_PER_BATCH 8 DTYPE "float16" SOLVER.MAX_ITER 60000 SOLVER.VAL_PERIOD 2000 SOLVER.CHECKPOINT_PERIOD 4000 GLOVE_DIR /media/data3/hlf_data/SHAGCL/SHA-GCL-for-SGG/datasets/vg/glove OUTPUT_DIR /media/data3/hlf_data/SHAGCL/SHA-GCL-for-SGG/output/PredCls_train. Thx :)

May I ask if you have solved this problem?

zhanwenchen commented 1 year ago

I’m running into the same problem. My run also stopped without any error message. My PyTorch version is 1.8.2 LTS. The single GPU run is successful however.

zhanwenchen commented 1 year ago

Seems like a problem with the losses.backward() call as the hang happened during it. I still haven't figured out why.