Maelic / SGG-Benchmark

A New Benchmark for Scene Graph Generation, targeting real-world applications
MIT License
16 stars 3 forks source link

the issue about Examples of the Training Command #14

Open jiuxuanth opened 1 week ago

jiuxuanth commented 1 week ago

when i used command

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --master_port 10025 --nproc_per_node=2 tools/relation_train_net.py --task predcls --save-best --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" MODEL.ROI_RELATION_HEAD.PREDICTOR MotifPredictor SOLVER.IMS_PER_BATCH 12 TEST.IMS_PER_BATCH 2 DTYPE "float16" SOLVER.MAX_EPOCH 20 MODEL.PRETRAINED_DETECTOR_CKPT ./checkpoints/pretrained_faster_rcnn/model_final.pth OUTPUT_DIR ./checkpoints/motif-precls-exmp

,the temminal show

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port 10025 --nproc_per_node=1 tools/relation_train_net.py --task predcls --save-best --config-file "configs/VG150/e2e_relation_X_101_32_8_FPN_1x.yaml" MODEL.ROI_RELATION_HEAD.PREDICTOR MotifPredictor SOLVER.IMS_PER_BATCH 12 TEST.IMS_PER_BATCH 2 DTYPE "float16" SOLVER.MAX_EPOCH 20 MODEL.PRETRAINED_DETECTOR_CKPT ./checkpoints/pretrained_faster_rcnn/model_final.pth OUTPUT_DIR ./checkpoints/motif-precls-exmp
/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
usage: relation_train_net.py [-h] [--config-file FILE] [--dataset FILE] [--local_rank LOCAL_RANK] [--skip-test] [--use-wandb] [--verbose] [--task {predcls,sgcls,sgdet}]
                             [--name NAME] [--save-best]
                             ...
relation_train_net.py: error: unrecognized arguments: --local-rank=0
[2024-06-19 14:31:17,948] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 867919) of binary: /home/jiuth/anaconda3/envs/scene_graph_benchmark/bin/python
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in <module>
    main()
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main
    launch(args)
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch
    run(args)
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools/relation_train_net.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-19_14:31:17
  host      : DESKTOP-8F5VA63.
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 867919)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

so i used command

CUDA_VISIBLE_DEVICES=0 torchrun --master_port 10025 --nproc_per_node=1 tools/relation_train_net.py --task predcls --save-best --config-file "configs/VG150/e2e_relation_X_101_32_8_FPN_1x.yaml" MODEL.ROI_RELATION_HEAD.PREDICTOR MotifPredictor SOLVER.IMS_PER_BATCH 12 TEST.IMS_PER_BATCH 2 DTYPE "float16" SOLVER.MAX_EPOCH 20 MODEL.PRETRAINED_DETECTOR_CKPT ./checkpoints/pretrained_faster_rcnn/model_final.pth OUTPUT_DIR ./checkpoints/motif-precls-exmp

instead,but it show:

CUDA_VISIBLE_DEVICES=0 torchrun --master_port 10025 --nproc_per_node=1 tools/relation_train_net.py --task predcls --save-best --config-file "configs/VG150/e2e_relation_X_101_32_8_FPN_1x.yaml" MODEL.ROI_RELATION_HEAD.PREDICTOR MotifPredictor SOLVER.IMS_PER_BATCH 12 TEST.IMS_PER_BATCH 2 DTYPE "float16" SOLVER.MAX_EPOCH 20 MODEL.PRETRAINED_DETECTOR_CKPT ./checkpoints/pretrained_faster_rcnn/model_final.pth OUTPUT_DIR ./checkpoints/motif-precls-exmp
2024-06-19 14:33:32.746 | INFO     | sgg_benchmark.utils.logger:setup_logger:30 - Using loguru logger with level: INFO
2024-06-19 14:33:32.748 | INFO     | __main__:main:423 - Using 1 GPUs
2024-06-19 14:33:32.748 | INFO     | sgg_benchmark.utils.logger:logger_step:15 - #################### Step 1: Collecting environment info... ####################
2024-06-19 14:33:35.173 | INFO     | __main__:main:436 - Saving config into: ./checkpoints/motif-precls-exmp/config.yml
2024-06-19 14:33:35.178 | INFO     | sgg_benchmark.utils.logger:logger_step:15 - #################### Step 2: Building model... ####################
2024-06-19 14:33:37.077 | INFO     | sgg_benchmark.data.build:get_dataset_statistics:30 - ----------------------------------------------------------------------------------------------------
2024-06-19 14:33:37.078 | INFO     | sgg_benchmark.data.build:get_dataset_statistics:31 - get dataset statistics...
2024-06-19 14:33:37.078 | INFO     | sgg_benchmark.data.build:get_dataset_statistics:46 - Unable to load data statistics from: ./checkpoints/motif-precls-exmp/VG150_train_statistics.cache
Traceback (most recent call last):
  File "/home/jiuth/SGG-Benchmark/tools/relation_train_net.py", line 469, in <module>
    main()
  File "/home/jiuth/SGG-Benchmark/tools/relation_train_net.py", line 448, in main
    model, best_checkpoint = train(
                             ^^^^^^
  File "/home/jiuth/SGG-Benchmark/tools/relation_train_net.py", line 91, in train
    model = build_detection_model(cfg) 
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/SGG-Benchmark/sgg_benchmark/modeling/detector/detectors.py", line 11, in build_detection_model
    return meta_arch(cfg)
           ^^^^^^^^^^^^^^
  File "/home/jiuth/SGG-Benchmark/sgg_benchmark/modeling/detector/generalized_rcnn.py", line 31, in __init__
    self.roi_heads = build_roi_heads(cfg, self.backbone.out_channels)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/SGG-Benchmark/sgg_benchmark/modeling/roi_heads/roi_heads.py", line 69, in build_roi_heads
    roi_heads.append(("relation", build_roi_relation_head(cfg, in_channels)))
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/SGG-Benchmark/sgg_benchmark/modeling/roi_heads/relation_head/relation_head.py", line 117, in build_roi_relation_head
    return ROIRelationHead(cfg, in_channels)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/SGG-Benchmark/sgg_benchmark/modeling/roi_heads/relation_head/relation_head.py", line 35, in __init__
    statistics = get_dataset_statistics(cfg)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/SGG-Benchmark/sgg_benchmark/data/build.py", line 53, in get_dataset_statistics
    dataset = factory(**args)
              ^^^^^^^^^^^^^^^
  File "/home/jiuth/SGG-Benchmark/sgg_benchmark/data/datasets/visual_genome.py", line 74, in __init__
    self.filenames = [self.filenames[i] for i in np.where(self.split_mask)[0]]
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/SGG-Benchmark/sgg_benchmark/data/datasets/visual_genome.py", line 74, in <listcomp>
    self.filenames = [self.filenames[i] for i in np.where(self.split_mask)[0]]
                      ~~~~~~~~~~~~~~^^^
IndexError: list index out of range
[2024-06-19 14:33:40,174] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 868457) of binary: /home/jiuth/anaconda3/envs/scene_graph_benchmark/bin/python
Traceback (most recent call last):
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.2.1', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiuth/anaconda3/envs/scene_graph_benchmark/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools/relation_train_net.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-19_14:33:40
  host      : DESKTOP-8F5VA63.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 868457)
  error_file: <N/A>
  traceback : To enable traceback see: **https://pytorch.org/docs/stable/elastic/errors.html**

and an other question is what version clip in the code used,openai clip or clip in pip?

Maelic commented 1 week ago

Hi, please have a look at this issue, this seems to be related to the download of the dataset. You need to download and unzip the images in one folder correctly and then modify the paths_catalog.py file accordingly.

Regarding the usage of clip, I am using the ultralytics version to be compatible with yolo world, so the install command is: pip install git+https://github.com/ultralytics/CLIP.git