Thank you for this excellent work.. learning a lot from it..
However, I am stuck..
when i try to train this on colab pro I get the error below.:
/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Traceback (most recent call last):
File "train.py", line 284, in
Traceback (most recent call last):
File "train.py", line 284, in
Traceback (most recent call last):
File "train.py", line 284, in
main(args)
File "train.py", line 129, in main
Traceback (most recent call last):
File "train.py", line 284, in
main(args)
File "train.py", line 129, in main
utils.init_distributed_mode(args)
File "/content/gdrive/MyDrive/Colab_Notebooks/TransVG/utils/misc.py", line 453, in init_distributed_mode
utils.init_distributed_mode(args)
File "/content/gdrive/MyDrive/Colab_Notebooks/TransVG/utils/misc.py", line 453, in init_distributed_mode
main(args)
File "train.py", line 129, in main
main(args)
File "train.py", line 129, in main
torch.cuda.set_device(args.gpu)
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
utils.init_distributed_mode(args)
File "/content/gdrive/MyDrive/Colab_Notebooks/TransVG/utils/misc.py", line 453, in init_distributed_mode
utils.init_distributed_mode(args)
File "/content/gdrive/MyDrive/Colab_Notebooks/TransVG/utils/misc.py", line 453, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
torch.cuda.set_device(args.gpu)
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
torch.cuda.set_device(args.gpu)
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "train.py", line 284, in
main(args)
File "train.py", line 129, in main
utils.init_distributed_mode(args)
File "/content/gdrive/MyDrive/Colab_Notebooks/TransVG/utils/misc.py", line 453, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
| distributed init (rank 0): env://
Traceback (most recent call last):
File "train.py", line 284, in
main(args)
File "train.py", line 129, in main
utils.init_distributed_mode(args)
File "/content/gdrive/MyDrive/Colab_Notebooks/TransVG/utils/misc.py", line 453, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "train.py", line 284, in
main(args)
File "train.py", line 129, in main
utils.init_distributed_mode(args)
File "/content/gdrive/MyDrive/Colab_Notebooks/TransVG/utils/misc.py", line 453, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3165 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 3166) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 193, in
main()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Thank you for this excellent work.. learning a lot from it..
However, I am stuck..
when i try to train this on colab pro I get the error below.:
/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionsFutureWarning, WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Traceback (most recent call last): File "train.py", line 284, in
Traceback (most recent call last):
File "train.py", line 284, in
Traceback (most recent call last):
File "train.py", line 284, in
main(args)
File "train.py", line 129, in main
Traceback (most recent call last):
File "train.py", line 284, in
main(args)
File "train.py", line 129, in main
utils.init_distributed_mode(args)
File "/content/gdrive/MyDrive/Colab_Notebooks/TransVG/utils/misc.py", line 453, in init_distributed_mode
utils.init_distributed_mode(args)
File "/content/gdrive/MyDrive/Colab_Notebooks/TransVG/utils/misc.py", line 453, in init_distributed_mode
main(args)
File "train.py", line 129, in main
main(args)
File "train.py", line 129, in main
torch.cuda.set_device(args.gpu)
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
utils.init_distributed_mode(args)
File "/content/gdrive/MyDrive/Colab_Notebooks/TransVG/utils/misc.py", line 453, in init_distributed_mode
utils.init_distributed_mode(args)
File "/content/gdrive/MyDrive/Colab_Notebooks/TransVG/utils/misc.py", line 453, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
torch.cuda.set_device(args.gpu)
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
torch.cuda.set_device(args.gpu)
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "train.py", line 284, in
main(args)
File "train.py", line 129, in main
utils.init_distributed_mode(args)
File "/content/gdrive/MyDrive/Colab_Notebooks/TransVG/utils/misc.py", line 453, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
| distributed init (rank 0): env://
Traceback (most recent call last):
File "train.py", line 284, in
main(args)
File "train.py", line 129, in main
utils.init_distributed_mode(args)
File "/content/gdrive/MyDrive/Colab_Notebooks/TransVG/utils/misc.py", line 453, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "train.py", line 284, in
main(args)
File "train.py", line 129, in main
utils.init_distributed_mode(args)
File "/content/gdrive/MyDrive/Colab_Notebooks/TransVG/utils/misc.py", line 453, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 311, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3165 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 3166) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 193, in
main()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures: [1]: time : 2022-03-28_19:11:13 host : 5c6c6efee186 rank : 2 (local_rank: 2) exitcode : 1 (pid: 3167) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2022-03-28_19:11:13 host : 5c6c6efee186 rank : 3 (local_rank: 3) exitcode : 1 (pid: 3168) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2022-03-28_19:11:13 host : 5c6c6efee186 rank : 4 (local_rank: 4) exitcode : 1 (pid: 3169) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2022-03-28_19:11:13 host : 5c6c6efee186 rank : 5 (local_rank: 5) exitcode : 1 (pid: 3170) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2022-03-28_19:11:13 host : 5c6c6efee186 rank : 6 (local_rank: 6) exitcode : 1 (pid: 3171) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2022-03-28_19:11:13 host : 5c6c6efee186 rank : 7 (local_rank: 7) exitcode : 1 (pid: 3172) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2022-03-28_19:11:13 host : 5c6c6efee186 rank : 1 (local_rank: 1) exitcode : 1 (pid: 3166) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
how do I sort this?