the issue about Examples of the Training Command

when i used command

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --master_port 10025 --nproc_per_node=2 tools/relation_train_net.py --task predcls --save-best --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" MODEL.ROI_RELATION_HEAD.PREDICTOR MotifPredictor SOLVER.IMS_PER_BATCH 12 TEST.IMS_PER_BATCH 2 DTYPE "float16" SOLVER.MAX_EPOCH 20 MODEL.PRETRAINED_DETECTOR_CKPT ./checkpoints/pretrained_faster_rcnn/model_final.pth OUTPUT_DIR ./checkpoints/motif-precls-exmp

,the temminal show

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port 10025 --nproc_per_node=1 tools/relation_train_net.py --task predcls --save-best --config-file "configs/VG150/e2e_relation_X_101_32_8_FPN_1x.yaml" MODEL.ROI_RELATION_HEAD.PREDICTOR MotifPredictor SOLVER.IMS_PER_BATCH 12 TEST.IMS_PER_BATCH 2 DTYPE "float16" SOLVER.MAX_EPOCH 20 MODEL.PRETRAINED_DETECTOR_CKPT ./checkpoints/pretrained_faster_rcnn/model_final.pth OUTPUT_DIR ./checkpoints/motif-precls-exmp
/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
usage: relation_train_net.py [-h] [--config-file FILE] [--dataset FILE] [--local_rank LOCAL_RANK] [--skip-test] [--use-wandb] [--verbose] [--task {predcls,sgcls,sgdet}]
                             [--name NAME] [--save-best]
                             ...
relation_train_net.py: error: unrecognized arguments: --local-rank=0
[2024-06-19 14:31:17,948] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 867919) of binary: /home/jiuth/anaconda3/envs/scene_graph_benchmark/bin/python
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in <module>
    main()
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main
    launch(args)
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch
    run(args)
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools/relation_train_net.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-19_14:31:17
  host      : DESKTOP-8F5VA63.
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 867919)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

so i used command

CUDA_VISIBLE_DEVICES=0 torchrun --master_port 10025 --nproc_per_node=1 tools/relation_train_net.py --task predcls --save-best --config-file "configs/VG150/e2e_relation_X_101_32_8_FPN_1x.yaml" MODEL.ROI_RELATION_HEAD.PREDICTOR MotifPredictor SOLVER.IMS_PER_BATCH 12 TEST.IMS_PER_BATCH 2 DTYPE "float16" SOLVER.MAX_EPOCH 20 MODEL.PRETRAINED_DETECTOR_CKPT ./checkpoints/pretrained_faster_rcnn/model_final.pth OUTPUT_DIR ./checkpoints/motif-precls-exmp

instead,but it show:

CUDA_VISIBLE_DEVICES=0 torchrun --master_port 10025 --nproc_per_node=1 tools/relation_train_net.py --task predcls --save-best --config-file "configs/VG150/e2e_relation_X_101_32_8_FPN_1x.yaml" MODEL.ROI_RELATION_HEAD.PREDICTOR MotifPredictor SOLVER.IMS_PER_BATCH 12 TEST.IMS_PER_BATCH 2 DTYPE "float16" SOLVER.MAX_EPOCH 20 MODEL.PRETRAINED_DETECTOR_CKPT ./checkpoints/pretrained_faster_rcnn/model_final.pth OUTPUT_DIR ./checkpoints/motif-precls-exmp
2024-06-19 14:33:32.746 | INFO     | sgg_benchmark.utils.logger:setup_logger:30 - Using loguru logger with level: INFO
2024-06-19 14:33:32.748 | INFO     | __main__:main:423 - Using 1 GPUs
2024-06-19 14:33:32.748 | INFO     | sgg_benchmark.utils.logger:logger_step:15 - #################### Step 1: Collecting environment info... ####################
2024-06-19 14:33:35.173 | INFO     | __main__:main:436 - Saving config into: ./checkpoints/motif-precls-exmp/config.yml
2024-06-19 14:33:35.178 | INFO     | sgg_benchmark.utils.logger:logger_step:15 - #################### Step 2: Building model... ####################
2024-06-19 14:33:37.077 | INFO     | sgg_benchmark.data.build:get_dataset_statistics:30 - ----------------------------------------------------------------------------------------------------
2024-06-19 14:33:37.078 | INFO     | sgg_benchmark.data.build:get_dataset_statistics:31 - get dataset statistics...
2024-06-19 14:33:37.078 | INFO     | sgg_benchmark.data.build:get_dataset_statistics:46 - Unable to load data statistics from: ./checkpoints/motif-precls-exmp/VG150_train_statistics.cache
Traceback (most recent call last):
  File "/home/jiuth/SGG-Benchmark/tools/relation_train_net.py", line 469, in <module>
    main()
  File "/home/jiuth/SGG-Benchmark/tools/relation_train_net.py", line 448, in main
    model, best_checkpoint = train(
                             ^^^^^^
  File "/home/jiuth/SGG-Benchmark/tools/relation_train_net.py", line 91, in train
    model = build_detection_model(cfg) 
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/SGG-Benchmark/sgg_benchmark/modeling/detector/detectors.py", line 11, in build_detection_model
    return meta_arch(cfg)
           ^^^^^^^^^^^^^^
  File "/home/jiuth/SGG-Benchmark/sgg_benchmark/modeling/detector/generalized_rcnn.py", line 31, in __init__
    self.roi_heads = build_roi_heads(cfg, self.backbone.out_channels)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/SGG-Benchmark/sgg_benchmark/modeling/roi_heads/roi_heads.py", line 69, in build_roi_heads
    roi_heads.append(("relation", build_roi_relation_head(cfg, in_channels)))
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/SGG-Benchmark/sgg_benchmark/modeling/roi_heads/relation_head/relation_head.py", line 117, in build_roi_relation_head
    return ROIRelationHead(cfg, in_channels)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/SGG-Benchmark/sgg_benchmark/modeling/roi_heads/relation_head/relation_head.py", line 35, in __init__
    statistics = get_dataset_statistics(cfg)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/SGG-Benchmark/sgg_benchmark/data/build.py", line 53, in get_dataset_statistics
    dataset = factory(**args)
              ^^^^^^^^^^^^^^^
  File "/home/jiuth/SGG-Benchmark/sgg_benchmark/data/datasets/visual_genome.py", line 74, in __init__
    self.filenames = [self.filenames[i] for i in np.where(self.split_mask)[0]]
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/SGG-Benchmark/sgg_benchmark/data/datasets/visual_genome.py", line 74, in <listcomp>
    self.filenames = [self.filenames[i] for i in np.where(self.split_mask)[0]]
                      ~~~~~~~~~~~~~~^^^
IndexError: list index out of range
[2024-06-19 14:33:40,174] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 868457) of binary: /home/jiuth/anaconda3/envs/scene_graph_benchmark/bin/python
Traceback (most recent call last):
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.2.1', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools/relation_train_net.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-19_14:33:40
  host      : DESKTOP-8F5VA63.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 868457)
  error_file: <N/A>
  traceback : To enable traceback see: **https://pytorch.org/docs/stable/elastic/errors.html**

and an other question is what version clip in the code used,openai clip or clip in pip?

Maelic / SGG-Benchmark

the issue about Examples of the Training Command #14