KaihuaTang / Scene-Graph-Benchmark.pytorch

A new codebase for popular Scene Graph Generation methods (2020). Visualization & Scene Graph Extraction on custom images/datasets are provided. It's also a PyTorch implementation of paper “Unbiased Scene Graph Generation from Biased Training CVPR 2020”
MIT License
1.03k stars 228 forks source link

Question about apex issues #146

Open DH-HAN opened 2 years ago

DH-HAN commented 2 years ago

❓ Questions and Help

I try to train model following your training examples 2(SGCls, Causal, TDE, SUM Fusion, MOTIFS Model)

Cause i'm using 1GPU i set my own environment like this

CUDA_VISIBLE_DEVICES=0 nproc_per_node=1 TEST.IMS_PER_BATCH 1

but in the process of relation_train_net.py error came out

File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/amp/utils.py", line 97, in cached_cast if cached_x.grad_fn.next_functions[1][0].variable is not x: IndexError: tuple index out of range

maybe i think it is problem of apex but have you ever seen problem like this? i add my whole error message

2021-11-02 23:12:28,519 maskrcnn_benchmark INFO: Start training Traceback (most recent call last): File "tools/relation_train_net.py", line 379, in main() File "tools/relation_train_net.py", line 372, in main model = train(cfg, args.local_rank, args.distributed, logger) File "tools/relation_train_net.py", line 147, in train loss_dict = model(images, targets) File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 52, in forward x, result, detector_losses = self.roi_heads(features, proposals, targets, logger) File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/roi_heads.py", line 69, in forward x, detections, loss_relation = self.relation(features, detections, targets, logger) File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/relation_head.py", line 80, in forward refine_logits, relation_logits, add_losses = self.predictor(proposals, rel_pair_idxs, rel_labels, rel_binarys, roi_features, union_features, logger) File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/roi_relation_predictors.py", line 574, in forward post_ctx_rep, pair_pred, pair_bbox, pair_obj_probs, binary_preds, obj_dist_prob, edge_rep, obj_dist_list = self.pair_feature_generate(roi_features, proposals, rel_pair_idxs, num_objs, obj_boxs, logger) File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/roi_relation_predictors.py", line 521, in pair_feature_generate obj_dists, obj_preds, edge_ctx, binary_preds = self.context_layer(roi_features, proposals, rel_pair_idxs, logger, ctx_average=ctx_average) File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/model_motifs.py", line 382, in forward obj_dists, obj_preds, obj_ctx, perm, inv_perm, ls_transposed = self.obj_ctx(obj_pre_rep, proposals, obj_labels, boxes_per_cls, ctx_average=ctx_average) File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/model_motifs.py", line 324, in obj_ctx boxes_for_nms=boxes_per_cls[perm] if boxes_per_cls is not None else None, File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/model_motifs.py", line 163, in forward previous_memory, dropout_mask=dropout_mask) File "/home/han/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/model_motifs.py", line 94, in lstm_equations projected_input = self.input_linearity(timestep_input) File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 103, in forward return F.linear(input, self.weight, self.bias) File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/amp/wrap.py", line 21, in wrapper args[i] = utils.cached_cast(cast_fn, args[i], handle.cache) File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/amp/utils.py", line 97, in cached_cast if cached_x.grad_fn.next_functions[1][0].variable is not x: IndexError: tuple index out of range ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12140) of binary: /home/han/anaconda3/envs/scene_graph_benchmark/bin/python Traceback (most recent call last): File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run )(cmd_args) File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/han/anaconda3/envs/scene_graph_benchmark/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Maelic commented 2 years ago

Did you solve this? I am facing the same issue using only one GPU (RTX 3080)

DrugD commented 2 years ago

俺也是,IndexError: tuple index out of range

跑的命令是这个 export CUDA_VISIBLE_DEVICES=1 python tools/relation_train_net.py --config-file "configs/e2e_relation_VGG16_1x.yaml" MODEL.ROI_RELATION_HEAD.USE_GT_BOX True MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL False MODEL.ROI_RELATION_HEAD.PREDICTOR CausalAnalysisPredictor MODEL.ROI_RELATION_HEAD.CAUSAL.EFFECT_TYPE none MODEL.ROI_RELATION_HEAD.CAUSAL.FUSION_TYPE sum MODEL.ROI_RELATION_HEAD.CAUSAL.CONTEXT_LAYER vctree SOLVER.IMS_PER_BATCH 4 TEST.IMS_PER_BATCH 2 DTYPE "float16" SOLVER.MAX_ITER 50000 SOLVER.VAL_PERIOD 3000 SOLVER.CHECKPOINT_PERIOD 3000 GLOVE_DIR Scene-Graph-Benchmark.pytorch/glove MODEL.PRETRAINED_DETECTOR_CKPT Scene-Graph-Benchmark.pytorch/checkpoints/faster_rcnn2/model_0024000.pth OUTPUT_DIR Scene-Graph-Benchmark.pytorch/checkpoints/causal_motif_sgcls

谁来看看咋回事

qncsn2016 commented 1 year ago

https://github.com/NVIDIA/apex/issues/694#issuecomment-918833904

我按照这个连接解决了

A91A981E commented 1 year ago

I solved this issue by removing parameter DTYPE "float16" in training command. It seems that something unexpected happened in APEX?