AssertionError: Invalid device id

WongKinYiu / yolov7

Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

GNU General Public License v3.0

13.33k stars 4.2k forks source link

!# Download yolov7-pose code !git clone https://github.com/WongKinYiu/yolov7.git -b pose %cd yolov7 %pip install -r requirements.txt # install import sys import torch print(f"Python version: {sys.version}, {sys.version_info} ") print(f"Pytorch version: {torch.__version__} ")

!python -m torch.distributed.launch --nproc_per_node 8 --master_port 9527 train.py --data data/coco_kpts.yaml --cfg cfg/yolov7-w6-pose.yaml --weights /content/drive/MyDrive/cv_tennis/YOLOV7-pose/weights/yolov7-w6-person.pt --batch-size 128 --img 960 --kpt-label --sync-bn --device 0,1,2,3,4,5,6,7 --name yolov7-w6-pose --hyp data/hyp.pose.yaml

/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects `--local_rank` argument to be set, please change it to read from `os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn( github: fatal: ambiguous argument 'pose..origin/master': unknown revision or path not in the working tree. Use '--' to separate paths from revisions, like this: 'git <command> [<revision>...] -- [<file>...]' Command 'git rev-list pose..origin/master --count' returned non-zero exit status 128. Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): File "train.py", line 541, in <module> File "train.py", line 541, in <module> device = select_device(opt.device, batch_size=opt.batch_size) File "/content/yolov7/utils/torch_utils.py", line 80, in select_device p = torch.cuda.get_device_properties(i) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 374, in get_device_properties File "train.py", line 541, in <module> File "train.py", line 541, in <module> device = select_device(opt.device, batch_size=opt.batch_size) File "/content/yolov7/utils/torch_utils.py", line 80, in select_device File "train.py", line 541, in <module> File "train.py", line 541, in <module> raise AssertionError("Invalid device id") AssertionError: Invalid device idp = torch.cuda.get_device_properties(i) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 374, in get_device_properties device = select_device(opt.device, batch_size=opt.batch_size) File "/content/yolov7/utils/torch_utils.py", line 80, in select_device device = select_device(opt.device, batch_size=opt.batch_size) File "/content/yolov7/utils/torch_utils.py", line 80, in select_device device = select_device(opt.device, batch_size=opt.batch_size)device = select_device(opt.device, batch_size=opt.batch_size) File "/content/yolov7/utils/torch_utils.py", line 80, in select_device File "/content/yolov7/utils/torch_utils.py", line 80, in select_device raise AssertionError("Invalid device id") AssertionError: Invalid device idp = torch.cuda.get_device_properties(i) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 374, in get_device_properties p = torch.cuda.get_device_properties(i) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 374, in get_device_properties p = torch.cuda.get_device_properties(i)p = torch.cuda.get_device_properties(i) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 374, in get_device_properties File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 374, in get_device_properties raise AssertionError("Invalid device id") AssertionErrorraise AssertionError("Invalid device id"): Invalid device id AssertionError: raise AssertionError("Invalid device id") AssertionErrorInvalid device id : raise AssertionError("Invalid device id") AssertionErrorInvalid device id : Invalid device id Traceback (most recent call last): File "train.py", line 541, in <module> device = select_device(opt.device, batch_size=opt.batch_size) File "/content/yolov7/utils/torch_utils.py", line 80, in select_device p = torch.cuda.get_device_properties(i) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 374, in get_device_properties raise AssertionError("Invalid device id") AssertionError: Invalid device id Traceback (most recent call last): File "train.py", line 541, in <module> device = select_device(opt.device, batch_size=opt.batch_size) File "/content/yolov7/utils/torch_utils.py", line 80, in select_device p = torch.cuda.get_device_properties(i) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 374, in get_device_properties raise AssertionError("Invalid device id") AssertionError: Invalid device id WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 26995 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 26996 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 26997 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 27001 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 26998) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in <module> main() File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2023-02-05_10:13:47 host : 71408a468f93 rank : 4 (local_rank: 4) exitcode : 1 (pid: 26999) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-02-05_10:13:47 host : 71408a468f93 rank : 5 (local_rank: 5) exitcode : 1 (pid: 27000) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-02-05_10:13:47 host : 71408a468f93 rank : 7 (local_rank: 7) exitcode : 1 (pid: 27002) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-02-05_10:13:47 host : 71408a468f93 rank : 3 (local_rank: 3) exitcode : 1 (pid: 26998) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

Hi!

I tried to solve the problem with the first error:

   warnings.warn(
github: fatal: ambiguous argument 'pose..origin/master': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
Command 'git rev-list pose..origin/master --count' returned non-zero exit status 128.

I changed the command to clone Git repository from:

!git clone https://github.com/WongKinYiu/yolov7.git -b pose

into:

!git clone https://github.com/WongKinYiu/yolov7.git
!git checkout -b pose yolov7/pose

And I get another error >> train.py: error: unrecognized arguments: --kpt-label.

How can this be fixed?

Error message:

/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
usage: train.py
       [-h]
       [--weights WEIGHTS]
       [--cfg CFG]
       [--data DATA]
       [--hyp HYP]
       [--epochs EPOCHS]
       [--batch-size BATCH_SIZE]
       [--img-size IMG_SIZE [IMG_SIZE ...]]
       [--rect]
       [--resume [RESUME]]
       [--nosave]
       [--notest]
       [--noautoanchor]
       [--evolve]
       [--bucket BUCKET]
       [--cache-images]
       [--image-weights]
       [--device DEVICE]
       [--multi-scale]
       [--single-cls]
       [--adam]
       [--sync-bn]
       [--local_rank LOCAL_RANK]
       [--workers WORKERS]
       [--project PROJECT]
       [--entity ENTITY]
       [--name NAME]
       [--exist-ok]
       [--quad]
       [--linear-lr]
       [--label-smoothing LABEL_SMOOTHING]
       [--upload_dataset]
       [--bbox_interval BBOX_INTERVAL]
       [--save_period SAVE_PERIOD]
       [--artifact_alias ARTIFACT_ALIAS]
       [--freeze FREEZE [FREEZE ...]]
       [--v5-metric]
train.py: error: unrecognized arguments: --kpt-label
usage: train.py
       [-h]
       [--weights WEIGHTS]
       [--cfg CFG]
       [--data DATA]
       [--hyp HYP]
       [--epochs EPOCHS]
       [--batch-size BATCH_SIZE]
       [--img-size IMG_SIZE [IMG_SIZE ...]]
       [--rect]
       [--resume [RESUME]]
       [--nosave]
       [--notest]
       [--noautoanchor]
       [--evolve]
       [--bucket BUCKET]
       [--cache-images]
       [--image-weights]
       [--device DEVICE]
       [--multi-scale]
       [--single-cls]
       [--adam]
       [--sync-bn]
       [--local_rank LOCAL_RANK]
       [--workers WORKERS]
       [--project PROJECT]
       [--entity ENTITY]
       [--name NAME]
       [--exist-ok]
       [--quad]
       [--linear-lr]
       [--label-smoothing LABEL_SMOOTHING]
       [--upload_dataset]
       [--bbox_interval BBOX_INTERVAL]
       [--save_period SAVE_PERIOD]
       [--artifact_alias ARTIFACT_ALIAS]
       [--freeze FREEZE [FREEZE ...]]
       [--v5-metric]
train.py: error: unrecognized arguments: --kpt-label
usage: train.py
       [-h]
       [--weights WEIGHTS]
       [--cfg CFG]
       [--data DATA]
       [--hyp HYP]
       [--epochs EPOCHS]
       [--batch-size BATCH_SIZE]
       [--img-size IMG_SIZE [IMG_SIZE ...]]
       [--rect]
       [--resume [RESUME]]
       [--nosave]
       [--notest]
       [--noautoanchor]
       [--evolve]
       [--bucket BUCKET]
       [--cache-images]
       [--image-weights]
       [--device DEVICE]
       [--multi-scale]
       [--single-cls]
       [--adam]
       [--sync-bn]
       [--local_rank LOCAL_RANK]
       [--workers WORKERS]
       [--project PROJECT]
       [--entity ENTITY]
       [--name NAME]
       [--exist-ok]
       [--quad]
       [--linear-lr]
       [--label-smoothing LABEL_SMOOTHING]
       [--upload_dataset]
       [--bbox_interval BBOX_INTERVAL]
       [--save_period SAVE_PERIOD]
       [--artifact_alias ARTIFACT_ALIAS]
       [--freeze FREEZE [FREEZE ...]]
       [--v5-metric]
train.py: error: unrecognized arguments: --kpt-label
usage: train.py
       [-h]
       [--weights WEIGHTS]
       [--cfg CFG]
       [--data DATA]
       [--hyp HYP]
       [--epochs EPOCHS]
       [--batch-size BATCH_SIZE]
       [--img-size IMG_SIZE [IMG_SIZE ...]]
       [--rect]
       [--resume [RESUME]]
       [--nosave]
       [--notest]
       [--noautoanchor]
       [--evolve]
       [--bucket BUCKET]
       [--cache-images]
       [--image-weights]
       [--device DEVICE]
       [--multi-scale]
       [--single-cls]
       [--adam]
       [--sync-bn]
       [--local_rank LOCAL_RANK]
       [--workers WORKERS]
       [--project PROJECT]
       [--entity ENTITY]
       [--name NAME]
       [--exist-ok]
       [--quad]
       [--linear-lr]
       [--label-smoothing LABEL_SMOOTHING]
       [--upload_dataset]
       [--bbox_interval BBOX_INTERVAL]
       [--save_period SAVE_PERIOD]
       [--artifact_alias ARTIFACT_ALIAS]
       [--freeze FREEZE [FREEZE ...]]
       [--v5-metric]
train.py: error: unrecognized arguments: --kpt-label
usage: train.py
       [-h]
       [--weights WEIGHTS]
       [--cfg CFG]
       [--data DATA]
       [--hyp HYP]
       [--epochs EPOCHS]
       [--batch-size BATCH_SIZE]
       [--img-size IMG_SIZE [IMG_SIZE ...]]
       [--rect]
       [--resume [RESUME]]
       [--nosave]
       [--notest]
       [--noautoanchor]
       [--evolve]
       [--bucket BUCKET]
       [--cache-images]
       [--image-weights]
       [--device DEVICE]
       [--multi-scale]
       [--single-cls]
       [--adam]
       [--sync-bn]
       [--local_rank LOCAL_RANK]
       [--workers WORKERS]
       [--project PROJECT]
       [--entity ENTITY]
       [--name NAME]
       [--exist-ok]
       [--quad]
       [--linear-lr]
       [--label-smoothing LABEL_SMOOTHING]
       [--upload_dataset]
       [--bbox_interval BBOX_INTERVAL]
       [--save_period SAVE_PERIOD]
       [--artifact_alias ARTIFACT_ALIAS]
       [--freeze FREEZE [FREEZE ...]]
       [--v5-metric]
train.py: error: unrecognized arguments: --kpt-label
usage: train.py
       [-h]
       [--weights WEIGHTS]
       [--cfg CFG]
       [--data DATA]
       [--hyp HYP]
       [--epochs EPOCHS]
       [--batch-size BATCH_SIZE]
       [--img-size IMG_SIZE [IMG_SIZE ...]]
       [--rect]
       [--resume [RESUME]]
       [--nosave]
       [--notest]
       [--noautoanchor]
       [--evolve]
       [--bucket BUCKET]
       [--cache-images]
       [--image-weights]
       [--device DEVICE]
       [--multi-scale]
       [--single-cls]
       [--adam]
       [--sync-bn]
       [--local_rank LOCAL_RANK]
       [--workers WORKERS]
       [--project PROJECT]
       [--entity ENTITY]
       [--name NAME]
       [--exist-ok]
       [--quad]
       [--linear-lr]
       [--label-smoothing LABEL_SMOOTHING]
       [--upload_dataset]
       [--bbox_interval BBOX_INTERVAL]
       [--save_period SAVE_PERIOD]
       [--artifact_alias ARTIFACT_ALIAS]
       [--freeze FREEZE [FREEZE ...]]
       [--v5-metric]
train.py: error: unrecognized arguments: --kpt-label
usage: train.py
       [-h]
       [--weights WEIGHTS]
       [--cfg CFG]
       [--data DATA]
       [--hyp HYP]
       [--epochs EPOCHS]
       [--batch-size BATCH_SIZE]
       [--img-size IMG_SIZE [IMG_SIZE ...]]
       [--rect]
       [--resume [RESUME]]
       [--nosave]
       [--notest]
       [--noautoanchor]
       [--evolve]
       [--bucket BUCKET]
       [--cache-images]
       [--image-weights]
       [--device DEVICE]
       [--multi-scale]
       [--single-cls]
       [--adam]
       [--sync-bn]
       [--local_rank LOCAL_RANK]
       [--workers WORKERS]
       [--project PROJECT]
       [--entity ENTITY]
       [--name NAME]
       [--exist-ok]
       [--quad]
       [--linear-lr]
       [--label-smoothing LABEL_SMOOTHING]
       [--upload_dataset]
       [--bbox_interval BBOX_INTERVAL]
       [--save_period SAVE_PERIOD]
       [--artifact_alias ARTIFACT_ALIAS]
       [--freeze FREEZE [FREEZE ...]]
       [--v5-metric]
train.py: error: unrecognized arguments: --kpt-label
usage: train.py
       [-h]
       [--weights WEIGHTS]
       [--cfg CFG]
       [--data DATA]
       [--hyp HYP]
       [--epochs EPOCHS]
       [--batch-size BATCH_SIZE]
       [--img-size IMG_SIZE [IMG_SIZE ...]]
       [--rect]
       [--resume [RESUME]]
       [--nosave]
       [--notest]
       [--noautoanchor]
       [--evolve]
       [--bucket BUCKET]
       [--cache-images]
       [--image-weights]
       [--device DEVICE]
       [--multi-scale]
       [--single-cls]
       [--adam]
       [--sync-bn]
       [--local_rank LOCAL_RANK]
       [--workers WORKERS]
       [--project PROJECT]
       [--entity ENTITY]
       [--name NAME]
       [--exist-ok]
       [--quad]
       [--linear-lr]
       [--label-smoothing LABEL_SMOOTHING]
       [--upload_dataset]
       [--bbox_interval BBOX_INTERVAL]
       [--save_period SAVE_PERIOD]
       [--artifact_alias ARTIFACT_ALIAS]
       [--freeze FREEZE [FREEZE ...]]
       [--v5-metric]
train.py: error: unrecognized arguments: --kpt-label
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 9461) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-02-05_18:44:57
  host      : 77f91b524d5d
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 9462)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-02-05_18:44:57
  host      : 77f91b524d5d
  rank      : 2 (local_rank: 2)
  exitcode  : 2 (pid: 9463)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-02-05_18:44:57
  host      : 77f91b524d5d
  rank      : 3 (local_rank: 3)
  exitcode  : 2 (pid: 9464)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2023-02-05_18:44:57
  host      : 77f91b524d5d
  rank      : 4 (local_rank: 4)
  exitcode  : 2 (pid: 9465)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2023-02-05_18:44:57
  host      : 77f91b524d5d
  rank      : 5 (local_rank: 5)
  exitcode  : 2 (pid: 9466)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2023-02-05_18:44:57
  host      : 77f91b524d5d
  rank      : 6 (local_rank: 6)
  exitcode  : 2 (pid: 9467)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2023-02-05_18:44:57
  host      : 77f91b524d5d
  rank      : 7 (local_rank: 7)
  exitcode  : 2 (pid: 9468)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-05_18:44:57
  host      : 77f91b524d5d
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 9461)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects `--local_rank` argument to be set, please change it to read from `os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn( github: fatal: ambiguous argument 'pose..origin/master': unknown revision or path not in the working tree. Use '--' to separate paths from revisions, like this: 'git <command> [<revision>...] -- [<file>...]' Command 'git rev-list pose..origin/master --count' returned non-zero exit status 128. Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): File "train.py", line 543, in <module> File "train.py", line 543, in <module> assert torch.cuda.device_count() > opt.local_rank File "train.py", line 543, in <module> AssertionError File "train.py", line 543, in <module> File "train.py", line 543, in <module> assert torch.cuda.device_count() > opt.local_rank AssertionError assert torch.cuda.device_count() > opt.local_rank assert torch.cuda.device_count() > opt.local_rankAssertionError AssertionError File "train.py", line 543, in <module> assert torch.cuda.device_count() > opt.local_rank AssertionError assert torch.cuda.device_count() > opt.local_rank AssertionError Traceback (most recent call last): File "train.py", line 543, in <module> assert torch.cuda.device_count() > opt.local_rank AssertionError YOLOv5 � cad7aca torch 1.13.1+cu116 CUDA:0 (Tesla T4, 15109.875MB) Added key: store_based_barrier_key:1 to store for rank: 0 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2575 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2576) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in <module> main() File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2023-02-08_09:15:15 host : 889fd56b03f0 rank : 2 (local_rank: 2) exitcode : 1 (pid: 2577) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-02-08_09:15:15 host : 889fd56b03f0 rank : 3 (local_rank: 3) exitcode : 1 (pid: 2578) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-02-08_09:15:15 host : 889fd56b03f0 rank : 4 (local_rank: 4) exitcode : 1 (pid: 2579) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2023-02-08_09:15:15 host : 889fd56b03f0 rank : 5 (local_rank: 5) exitcode : 1 (pid: 2580) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2023-02-08_09:15:15 host : 889fd56b03f0 rank : 6 (local_rank: 6) exitcode : 1 (pid: 2581) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2023-02-08_09:15:15 host : 889fd56b03f0 rank : 7 (local_rank: 7) exitcode : 1 (pid: 2582) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-02-08_09:15:15 host : 889fd56b03f0 rank : 1 (local_rank: 1) exitcode : 1 (pid: 2576) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

WongKinYiu / yolov7

AssertionError: Invalid device id #1463