Train SGDet Error - Githubissues

KaihuaTang / Scene-Graph-Benchmark.pytorch

A new codebase for popular Scene Graph Generation methods (2020). Visualization & Scene Graph Extraction on custom images/datasets are provided. It's also a PyTorch implementation of paper “Unbiased Scene Graph Generation from Biased Training CVPR 2020”

MIT License

1.03k stars 228 forks source link

Train SGDet Error #81

Open zhangyuan1994511 opened 3 years ago

zhangyuan1994511 commented 3 years ago

@KaihuaTang Hi, author! Thanks for your great work! I want train SGDet e2e with the script: 'CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --master_port 10026 --nproc_per_node=2 tools/relation_train_net.py --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" MODEL.ROI_RELATION_HEAD.USE_GT_BOX False MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL False MODEL.ROI_RELATION_HEAD.PREDICTOR CausalAnalysisPredictor MODEL.ROI_RELATION_HEAD.CAUSAL.EFFECT_TYPE none MODEL.ROI_RELATION_HEAD.CAUSAL.FUSION_TYPE sum MODEL.ROI_RELATION_HEAD.CAUSAL.CONTEXT_LAYER vctree SOLVER.IMS_PER_BATCH 12 TEST.IMS_PER_BATCH 2 DTYPE "float16" SOLVER.MAX_ITER 50000 SOLVER.VAL_PERIOD 2000 SOLVER.CHECKPOINT_PERIOD 2000 GLOVE_DIR ./glove MODEL.PRETRAINED_DETECTOR_CKPT ./pretrained_faster_rcnn/model_final.pth OUTPUT_DIR ./checkpoints/causal-motifs-sgcls-exmp'. But get a 'RuntimeError: CUDA error: devices-side assert triggered', why? Looking forward your reply!

zhangyuan1994511 commented 3 years ago

Or could you provided the train script about train SGDets e2e?

KaihuaTang commented 3 years ago

This error usually caused by a tensor from device A trying to calculate with another tensor from device B, but I never had this error before.

Your script looks correct. Could it be the problem of APEX? Because this code use APEX to conduct distributed training.

zhangyuan1994511 commented 3 years ago

But I change the 'MODEL.ROI_RELATION_HEAD.CAUSAL.CONTEXT_LAYER vctree' to 'MODEL.ROI_RELATION_HEAD.CAUSAL.CONTEXT_LAYER motifs' ,the training process is OK...

zhangyuan1994511 commented 3 years ago

So, it's may not be the problem of APEX. Or the problem of train label?

KaihuaTang commented 3 years ago

So, it's may not be the problem of APEX. Or the problem of train label?

I see. It could be my fault because I updated part of the VCTree codes after I published the paper. I will check it once I have free GPUs (I'm currently working on another project).

zhangyuan1994511 commented 3 years ago

OK, Thank you for your reply! If you fixed the bug, please tell me~

wtt0213 commented 3 years ago

@KaihuaTang Have you figured out this error ? I met the same problem

YangBowenn commented 2 years ago

When I run the train code I meet the same error 'RuntimeError: CUDA error: devices-side assert triggered', that is a pytorch bug. The environment I used is torch1.8.1+cuda11.1，I update to the torch1.8.2+cuda11.1, the bug disappear.

@KaihuaTang Have you figured out this error ? I met the same problem