WongKinYiu / yolov7

Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
GNU General Public License v3.0
13.33k stars 4.2k forks source link

AssertionError: Invalid device id #1463

Open akatendra opened 1 year ago

akatendra commented 1 year ago

Hi!

I have an errors like: AssertionError: Invalid device id.

Please, help!

I try to use: https://github.com/WongKinYiu/yolov7/tree/pose

I made COLAB(with GPU) with code:

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

!# Download yolov7-pose code
!git clone https://github.com/WongKinYiu/yolov7.git -b pose
%cd yolov7
%pip install -r requirements.txt # install

import sys
import torch
print(f"Python version: {sys.version}, {sys.version_info} ")
print(f"Pytorch version: {torch.__version__} ")

os.environ.setdefault(key_value, '8')

Start training:

!python -m torch.distributed.launch --nproc_per_node 8 --master_port 9527 train.py --data data/coco_kpts.yaml --cfg cfg/yolov7-w6-pose.yaml --weights /content/drive/MyDrive/cv_tennis/YOLOV7-pose/weights/yolov7-w6-person.pt --batch-size 128 --img 960 --kpt-label --sync-bn --device 0,1,2,3,4,5,6,7 --name yolov7-w6-pose --hyp data/hyp.pose.yaml

And I get an error:

/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
github: fatal: ambiguous argument 'pose..origin/master': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
Command 'git rev-list pose..origin/master --count' returned non-zero exit status 128.
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "train.py", line 541, in <module>
      File "train.py", line 541, in <module>
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/content/yolov7/utils/torch_utils.py", line 80, in select_device
    p = torch.cuda.get_device_properties(i)
  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 374, in get_device_properties
      File "train.py", line 541, in <module>
      File "train.py", line 541, in <module>
device = select_device(opt.device, batch_size=opt.batch_size)
      File "/content/yolov7/utils/torch_utils.py", line 80, in select_device
  File "train.py", line 541, in <module>
      File "train.py", line 541, in <module>
    raise AssertionError("Invalid device id")
AssertionError: Invalid device idp = torch.cuda.get_device_properties(i)

  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 374, in get_device_properties
        device = select_device(opt.device, batch_size=opt.batch_size)
  File "/content/yolov7/utils/torch_utils.py", line 80, in select_device
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/content/yolov7/utils/torch_utils.py", line 80, in select_device
    device = select_device(opt.device, batch_size=opt.batch_size)device = select_device(opt.device, batch_size=opt.batch_size)

  File "/content/yolov7/utils/torch_utils.py", line 80, in select_device
  File "/content/yolov7/utils/torch_utils.py", line 80, in select_device
        raise AssertionError("Invalid device id")
AssertionError: Invalid device idp = torch.cuda.get_device_properties(i)

  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 374, in get_device_properties
p = torch.cuda.get_device_properties(i)    
  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 374, in get_device_properties
    p = torch.cuda.get_device_properties(i)p = torch.cuda.get_device_properties(i)
  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 374, in get_device_properties

  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 374, in get_device_properties
    raise AssertionError("Invalid device id")
AssertionErrorraise AssertionError("Invalid device id"): 
Invalid device id
AssertionError: raise AssertionError("Invalid device id")
AssertionErrorInvalid device id
: raise AssertionError("Invalid device id")
AssertionErrorInvalid device id
: Invalid device id
Traceback (most recent call last):
  File "train.py", line 541, in <module>
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/content/yolov7/utils/torch_utils.py", line 80, in select_device
    p = torch.cuda.get_device_properties(i)
  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 374, in get_device_properties
    raise AssertionError("Invalid device id")
AssertionError: Invalid device id
Traceback (most recent call last):
  File "train.py", line 541, in <module>
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/content/yolov7/utils/torch_utils.py", line 80, in select_device
    p = torch.cuda.get_device_properties(i)
  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 374, in get_device_properties
    raise AssertionError("Invalid device id")
AssertionError: Invalid device id
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 26995 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 26996 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 26997 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 27001 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 26998) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-02-05_10:13:47
  host      : 71408a468f93
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 26999)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-02-05_10:13:47
  host      : 71408a468f93
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 27000)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-02-05_10:13:47
  host      : 71408a468f93
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 27002)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-05_10:13:47
  host      : 71408a468f93
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 26998)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

What am I doing wrong?

Thanks!

akatendra commented 1 year ago

Hi!

I tried to solve the problem with the first error:

   warnings.warn(
github: fatal: ambiguous argument 'pose..origin/master': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
Command 'git rev-list pose..origin/master --count' returned non-zero exit status 128.

I changed the command to clone Git repository from:

!git clone https://github.com/WongKinYiu/yolov7.git -b pose

into:

!git clone https://github.com/WongKinYiu/yolov7.git
!git checkout -b pose yolov7/pose

And I get another error >> train.py: error: unrecognized arguments: --kpt-label.

How can this be fixed?

Error message:

/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
usage: train.py
       [-h]
       [--weights WEIGHTS]
       [--cfg CFG]
       [--data DATA]
       [--hyp HYP]
       [--epochs EPOCHS]
       [--batch-size BATCH_SIZE]
       [--img-size IMG_SIZE [IMG_SIZE ...]]
       [--rect]
       [--resume [RESUME]]
       [--nosave]
       [--notest]
       [--noautoanchor]
       [--evolve]
       [--bucket BUCKET]
       [--cache-images]
       [--image-weights]
       [--device DEVICE]
       [--multi-scale]
       [--single-cls]
       [--adam]
       [--sync-bn]
       [--local_rank LOCAL_RANK]
       [--workers WORKERS]
       [--project PROJECT]
       [--entity ENTITY]
       [--name NAME]
       [--exist-ok]
       [--quad]
       [--linear-lr]
       [--label-smoothing LABEL_SMOOTHING]
       [--upload_dataset]
       [--bbox_interval BBOX_INTERVAL]
       [--save_period SAVE_PERIOD]
       [--artifact_alias ARTIFACT_ALIAS]
       [--freeze FREEZE [FREEZE ...]]
       [--v5-metric]
train.py: error: unrecognized arguments: --kpt-label
usage: train.py
       [-h]
       [--weights WEIGHTS]
       [--cfg CFG]
       [--data DATA]
       [--hyp HYP]
       [--epochs EPOCHS]
       [--batch-size BATCH_SIZE]
       [--img-size IMG_SIZE [IMG_SIZE ...]]
       [--rect]
       [--resume [RESUME]]
       [--nosave]
       [--notest]
       [--noautoanchor]
       [--evolve]
       [--bucket BUCKET]
       [--cache-images]
       [--image-weights]
       [--device DEVICE]
       [--multi-scale]
       [--single-cls]
       [--adam]
       [--sync-bn]
       [--local_rank LOCAL_RANK]
       [--workers WORKERS]
       [--project PROJECT]
       [--entity ENTITY]
       [--name NAME]
       [--exist-ok]
       [--quad]
       [--linear-lr]
       [--label-smoothing LABEL_SMOOTHING]
       [--upload_dataset]
       [--bbox_interval BBOX_INTERVAL]
       [--save_period SAVE_PERIOD]
       [--artifact_alias ARTIFACT_ALIAS]
       [--freeze FREEZE [FREEZE ...]]
       [--v5-metric]
train.py: error: unrecognized arguments: --kpt-label
usage: train.py
       [-h]
       [--weights WEIGHTS]
       [--cfg CFG]
       [--data DATA]
       [--hyp HYP]
       [--epochs EPOCHS]
       [--batch-size BATCH_SIZE]
       [--img-size IMG_SIZE [IMG_SIZE ...]]
       [--rect]
       [--resume [RESUME]]
       [--nosave]
       [--notest]
       [--noautoanchor]
       [--evolve]
       [--bucket BUCKET]
       [--cache-images]
       [--image-weights]
       [--device DEVICE]
       [--multi-scale]
       [--single-cls]
       [--adam]
       [--sync-bn]
       [--local_rank LOCAL_RANK]
       [--workers WORKERS]
       [--project PROJECT]
       [--entity ENTITY]
       [--name NAME]
       [--exist-ok]
       [--quad]
       [--linear-lr]
       [--label-smoothing LABEL_SMOOTHING]
       [--upload_dataset]
       [--bbox_interval BBOX_INTERVAL]
       [--save_period SAVE_PERIOD]
       [--artifact_alias ARTIFACT_ALIAS]
       [--freeze FREEZE [FREEZE ...]]
       [--v5-metric]
train.py: error: unrecognized arguments: --kpt-label
usage: train.py
       [-h]
       [--weights WEIGHTS]
       [--cfg CFG]
       [--data DATA]
       [--hyp HYP]
       [--epochs EPOCHS]
       [--batch-size BATCH_SIZE]
       [--img-size IMG_SIZE [IMG_SIZE ...]]
       [--rect]
       [--resume [RESUME]]
       [--nosave]
       [--notest]
       [--noautoanchor]
       [--evolve]
       [--bucket BUCKET]
       [--cache-images]
       [--image-weights]
       [--device DEVICE]
       [--multi-scale]
       [--single-cls]
       [--adam]
       [--sync-bn]
       [--local_rank LOCAL_RANK]
       [--workers WORKERS]
       [--project PROJECT]
       [--entity ENTITY]
       [--name NAME]
       [--exist-ok]
       [--quad]
       [--linear-lr]
       [--label-smoothing LABEL_SMOOTHING]
       [--upload_dataset]
       [--bbox_interval BBOX_INTERVAL]
       [--save_period SAVE_PERIOD]
       [--artifact_alias ARTIFACT_ALIAS]
       [--freeze FREEZE [FREEZE ...]]
       [--v5-metric]
train.py: error: unrecognized arguments: --kpt-label
usage: train.py
       [-h]
       [--weights WEIGHTS]
       [--cfg CFG]
       [--data DATA]
       [--hyp HYP]
       [--epochs EPOCHS]
       [--batch-size BATCH_SIZE]
       [--img-size IMG_SIZE [IMG_SIZE ...]]
       [--rect]
       [--resume [RESUME]]
       [--nosave]
       [--notest]
       [--noautoanchor]
       [--evolve]
       [--bucket BUCKET]
       [--cache-images]
       [--image-weights]
       [--device DEVICE]
       [--multi-scale]
       [--single-cls]
       [--adam]
       [--sync-bn]
       [--local_rank LOCAL_RANK]
       [--workers WORKERS]
       [--project PROJECT]
       [--entity ENTITY]
       [--name NAME]
       [--exist-ok]
       [--quad]
       [--linear-lr]
       [--label-smoothing LABEL_SMOOTHING]
       [--upload_dataset]
       [--bbox_interval BBOX_INTERVAL]
       [--save_period SAVE_PERIOD]
       [--artifact_alias ARTIFACT_ALIAS]
       [--freeze FREEZE [FREEZE ...]]
       [--v5-metric]
train.py: error: unrecognized arguments: --kpt-label
usage: train.py
       [-h]
       [--weights WEIGHTS]
       [--cfg CFG]
       [--data DATA]
       [--hyp HYP]
       [--epochs EPOCHS]
       [--batch-size BATCH_SIZE]
       [--img-size IMG_SIZE [IMG_SIZE ...]]
       [--rect]
       [--resume [RESUME]]
       [--nosave]
       [--notest]
       [--noautoanchor]
       [--evolve]
       [--bucket BUCKET]
       [--cache-images]
       [--image-weights]
       [--device DEVICE]
       [--multi-scale]
       [--single-cls]
       [--adam]
       [--sync-bn]
       [--local_rank LOCAL_RANK]
       [--workers WORKERS]
       [--project PROJECT]
       [--entity ENTITY]
       [--name NAME]
       [--exist-ok]
       [--quad]
       [--linear-lr]
       [--label-smoothing LABEL_SMOOTHING]
       [--upload_dataset]
       [--bbox_interval BBOX_INTERVAL]
       [--save_period SAVE_PERIOD]
       [--artifact_alias ARTIFACT_ALIAS]
       [--freeze FREEZE [FREEZE ...]]
       [--v5-metric]
train.py: error: unrecognized arguments: --kpt-label
usage: train.py
       [-h]
       [--weights WEIGHTS]
       [--cfg CFG]
       [--data DATA]
       [--hyp HYP]
       [--epochs EPOCHS]
       [--batch-size BATCH_SIZE]
       [--img-size IMG_SIZE [IMG_SIZE ...]]
       [--rect]
       [--resume [RESUME]]
       [--nosave]
       [--notest]
       [--noautoanchor]
       [--evolve]
       [--bucket BUCKET]
       [--cache-images]
       [--image-weights]
       [--device DEVICE]
       [--multi-scale]
       [--single-cls]
       [--adam]
       [--sync-bn]
       [--local_rank LOCAL_RANK]
       [--workers WORKERS]
       [--project PROJECT]
       [--entity ENTITY]
       [--name NAME]
       [--exist-ok]
       [--quad]
       [--linear-lr]
       [--label-smoothing LABEL_SMOOTHING]
       [--upload_dataset]
       [--bbox_interval BBOX_INTERVAL]
       [--save_period SAVE_PERIOD]
       [--artifact_alias ARTIFACT_ALIAS]
       [--freeze FREEZE [FREEZE ...]]
       [--v5-metric]
train.py: error: unrecognized arguments: --kpt-label
usage: train.py
       [-h]
       [--weights WEIGHTS]
       [--cfg CFG]
       [--data DATA]
       [--hyp HYP]
       [--epochs EPOCHS]
       [--batch-size BATCH_SIZE]
       [--img-size IMG_SIZE [IMG_SIZE ...]]
       [--rect]
       [--resume [RESUME]]
       [--nosave]
       [--notest]
       [--noautoanchor]
       [--evolve]
       [--bucket BUCKET]
       [--cache-images]
       [--image-weights]
       [--device DEVICE]
       [--multi-scale]
       [--single-cls]
       [--adam]
       [--sync-bn]
       [--local_rank LOCAL_RANK]
       [--workers WORKERS]
       [--project PROJECT]
       [--entity ENTITY]
       [--name NAME]
       [--exist-ok]
       [--quad]
       [--linear-lr]
       [--label-smoothing LABEL_SMOOTHING]
       [--upload_dataset]
       [--bbox_interval BBOX_INTERVAL]
       [--save_period SAVE_PERIOD]
       [--artifact_alias ARTIFACT_ALIAS]
       [--freeze FREEZE [FREEZE ...]]
       [--v5-metric]
train.py: error: unrecognized arguments: --kpt-label
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 9461) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-02-05_18:44:57
  host      : 77f91b524d5d
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 9462)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-02-05_18:44:57
  host      : 77f91b524d5d
  rank      : 2 (local_rank: 2)
  exitcode  : 2 (pid: 9463)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-02-05_18:44:57
  host      : 77f91b524d5d
  rank      : 3 (local_rank: 3)
  exitcode  : 2 (pid: 9464)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2023-02-05_18:44:57
  host      : 77f91b524d5d
  rank      : 4 (local_rank: 4)
  exitcode  : 2 (pid: 9465)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2023-02-05_18:44:57
  host      : 77f91b524d5d
  rank      : 5 (local_rank: 5)
  exitcode  : 2 (pid: 9466)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2023-02-05_18:44:57
  host      : 77f91b524d5d
  rank      : 6 (local_rank: 6)
  exitcode  : 2 (pid: 9467)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2023-02-05_18:44:57
  host      : 77f91b524d5d
  rank      : 7 (local_rank: 7)
  exitcode  : 2 (pid: 9468)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-05_18:44:57
  host      : 77f91b524d5d
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 9461)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
akatendra commented 1 year ago

I try to modify --device 0,1,2,3,4,5,6,7 into --device 0 and set OMP_NUM_THREADS to '1':

os.environ.setdefault(key_value, '1')

I get an error:

/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
github: fatal: ambiguous argument 'pose..origin/master': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
Command 'git rev-list pose..origin/master --count' returned non-zero exit status 128.
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "train.py", line 543, in <module>
  File "train.py", line 543, in <module>
        assert torch.cuda.device_count() > opt.local_rank
  File "train.py", line 543, in <module>
AssertionError
      File "train.py", line 543, in <module>
      File "train.py", line 543, in <module>
    assert torch.cuda.device_count() > opt.local_rank
AssertionError
assert torch.cuda.device_count() > opt.local_rank
assert torch.cuda.device_count() > opt.local_rankAssertionError
AssertionError
  File "train.py", line 543, in <module>
    assert torch.cuda.device_count() > opt.local_rank
AssertionError

assert torch.cuda.device_count() > opt.local_rank
AssertionError
Traceback (most recent call last):
  File "train.py", line 543, in <module>
    assert torch.cuda.device_count() > opt.local_rank
AssertionError
YOLOv5 � cad7aca torch 1.13.1+cu116 CUDA:0 (Tesla T4, 15109.875MB)

Added key: store_based_barrier_key:1 to store for rank: 0
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2575 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2576) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-02-08_09:15:15
  host      : 889fd56b03f0
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2577)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-02-08_09:15:15
  host      : 889fd56b03f0
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 2578)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-02-08_09:15:15
  host      : 889fd56b03f0
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 2579)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2023-02-08_09:15:15
  host      : 889fd56b03f0
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 2580)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2023-02-08_09:15:15
  host      : 889fd56b03f0
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 2581)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2023-02-08_09:15:15
  host      : 889fd56b03f0
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 2582)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-08_09:15:15
  host      : 889fd56b03f0
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2576)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
b-niu commented 11 months ago

Hi @akatendra , it's a typo in the tutorial. You can check the tutorial of YoloV5, the correct command should be:

python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1

not torch.distributed.launch.

Hope this can help you.