Open DH-HAN opened 2 years ago
Did you solve this? I am facing the same issue using only one GPU (RTX 3080)
俺也是,IndexError: tuple index out of range
跑的命令是这个 export CUDA_VISIBLE_DEVICES=1 python tools/relation_train_net.py --config-file "configs/e2e_relation_VGG16_1x.yaml" MODEL.ROI_RELATION_HEAD.USE_GT_BOX True MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL False MODEL.ROI_RELATION_HEAD.PREDICTOR CausalAnalysisPredictor MODEL.ROI_RELATION_HEAD.CAUSAL.EFFECT_TYPE none MODEL.ROI_RELATION_HEAD.CAUSAL.FUSION_TYPE sum MODEL.ROI_RELATION_HEAD.CAUSAL.CONTEXT_LAYER vctree SOLVER.IMS_PER_BATCH 4 TEST.IMS_PER_BATCH 2 DTYPE "float16" SOLVER.MAX_ITER 50000 SOLVER.VAL_PERIOD 3000 SOLVER.CHECKPOINT_PERIOD 3000 GLOVE_DIR Scene-Graph-Benchmark.pytorch/glove MODEL.PRETRAINED_DETECTOR_CKPT Scene-Graph-Benchmark.pytorch/checkpoints/faster_rcnn2/model_0024000.pth OUTPUT_DIR Scene-Graph-Benchmark.pytorch/checkpoints/causal_motif_sgcls
谁来看看咋回事
I solved this issue by removing parameter DTYPE "float16"
in training command. It seems that something unexpected happened in APEX?
❓ Questions and Help
I try to train model following your training examples 2(SGCls, Causal, TDE, SUM Fusion, MOTIFS Model)
Cause i'm using 1GPU i set my own environment like this
CUDA_VISIBLE_DEVICES=0 nproc_per_node=1 TEST.IMS_PER_BATCH 1
but in the process of relation_train_net.py error came out
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/amp/utils.py", line 97, in cached_cast if cached_x.grad_fn.next_functions[1][0].variable is not x: IndexError: tuple index out of range
maybe i think it is problem of apex but have you ever seen problem like this? i add my whole error message
2021-11-02 23:12:28,519 maskrcnn_benchmark INFO: Start training Traceback (most recent call last): File "tools/relation_train_net.py", line 379, in
main()
File "tools/relation_train_net.py", line 372, in main
model = train(cfg, args.local_rank, args.distributed, logger)
File "tools/relation_train_net.py", line 147, in train
loss_dict = model(images, targets)
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, kwargs)
File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 52, in forward
x, result, detector_losses = self.roi_heads(features, proposals, targets, logger)
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/roi_heads.py", line 69, in forward
x, detections, loss_relation = self.relation(features, detections, targets, logger)
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/relation_head.py", line 80, in forward
refine_logits, relation_logits, add_losses = self.predictor(proposals, rel_pair_idxs, rel_labels, rel_binarys, roi_features, union_features, logger)
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, kwargs)
File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/roi_relation_predictors.py", line 574, in forward
post_ctx_rep, pair_pred, pair_bbox, pair_obj_probs, binary_preds, obj_dist_prob, edge_rep, obj_dist_list = self.pair_feature_generate(roi_features, proposals, rel_pair_idxs, num_objs, obj_boxs, logger)
File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/roi_relation_predictors.py", line 521, in pair_feature_generate
obj_dists, obj_preds, edge_ctx, binary_preds = self.context_layer(roi_features, proposals, rel_pair_idxs, logger, ctx_average=ctx_average)
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/model_motifs.py", line 382, in forward
obj_dists, obj_preds, obj_ctx, perm, inv_perm, ls_transposed = self.obj_ctx(obj_pre_rep, proposals, obj_labels, boxes_per_cls, ctx_average=ctx_average)
File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/model_motifs.py", line 324, in obj_ctx
boxes_for_nms=boxes_per_cls[perm] if boxes_per_cls is not None else None,
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/model_motifs.py", line 163, in forward
previous_memory, dropout_mask=dropout_mask)
File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/model_motifs.py", line 94, in lstm_equations
projected_input = self.input_linearity(timestep_input)
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/amp/wrap.py", line 21, in wrapper
args[i] = utils.cached_cast(cast_fn, args[i], handle.cache)
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/amp/utils.py", line 97, in cached_cast
if cached_x.grad_fn.next_functions[1][0].variable is not x:
IndexError: tuple index out of range
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12140) of binary: /home/han/anaconda3/envs/scene_graph_benchmark/bin/python
Traceback (most recent call last):
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
)( cmd_args)
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: