Open zhangyuan1994511 opened 3 years ago
Or could you provided the train script about train SGDets e2e?
This error usually caused by a tensor from device A trying to calculate with another tensor from device B, but I never had this error before.
Your script looks correct. Could it be the problem of APEX? Because this code use APEX to conduct distributed training.
But I change the 'MODEL.ROI_RELATION_HEAD.CAUSAL.CONTEXT_LAYER vctree' to 'MODEL.ROI_RELATION_HEAD.CAUSAL.CONTEXT_LAYER motifs' ,the training process is OK...
So, it's may not be the problem of APEX. Or the problem of train label?
So, it's may not be the problem of APEX. Or the problem of train label?
I see. It could be my fault because I updated part of the VCTree codes after I published the paper. I will check it once I have free GPUs (I'm currently working on another project).
OK, Thank you for your reply! If you fixed the bug, please tell me~
@KaihuaTang Have you figured out this error ? I met the same problem
When I run the train code I meet the same error 'RuntimeError: CUDA error: devices-side assert triggered', that is a pytorch bug. The environment I used is torch1.8.1+cuda11.1,I update to the torch1.8.2+cuda11.1, the bug disappear.
@KaihuaTang Have you figured out this error ? I met the same problem
@KaihuaTang Hi, author! Thanks for your great work! I want train SGDet e2e with the script: 'CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --master_port 10026 --nproc_per_node=2 tools/relation_train_net.py --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" MODEL.ROI_RELATION_HEAD.USE_GT_BOX False MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL False MODEL.ROI_RELATION_HEAD.PREDICTOR CausalAnalysisPredictor MODEL.ROI_RELATION_HEAD.CAUSAL.EFFECT_TYPE none MODEL.ROI_RELATION_HEAD.CAUSAL.FUSION_TYPE sum MODEL.ROI_RELATION_HEAD.CAUSAL.CONTEXT_LAYER vctree SOLVER.IMS_PER_BATCH 12 TEST.IMS_PER_BATCH 2 DTYPE "float16" SOLVER.MAX_ITER 50000 SOLVER.VAL_PERIOD 2000 SOLVER.CHECKPOINT_PERIOD 2000 GLOVE_DIR ./glove MODEL.PRETRAINED_DETECTOR_CKPT ./pretrained_faster_rcnn/model_final.pth OUTPUT_DIR ./checkpoints/causal-motifs-sgcls-exmp'. But get a 'RuntimeError: CUDA error: devices-side assert triggered', why? Looking forward your reply!