ModelTC / United-Perception

United Perception
Apache License 2.0
426 stars 65 forks source link

多卡训练遇到的bug ng=1单卡可以训练,多卡时遇到以下问题 #60

Open Leeon-K opened 1 year ago

Leeon-K commented 1 year ago

2023-01-30 06:57:42,675-rk0-launch.py#86:Rank 0 initialization finished. 2023-01-30 06:57:42,678-rk0-launch.py#86:Rank 1 initialization finished. Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/changkang.li/EOD/up/main.py", line 28, in main() File "/home/changkang.li/EOD/up/main.py", line 22, in main args.run(args) File "/home/changkang.li/EOD/up/commands/train.py", line 163, in _main launch(main, args.num_gpus_per_machine, args.num_machines, args=args, start_method=args.fork_method) File "/home/changkang.li/EOD/up/utils/env/launch.py", line 52, in launch mp.start_processes( File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/home/EOD/up/utils/env/launch.py", line 113, in _distributed_worker dist_helper.barrier() File "/home//EOD/up/utils/env/dist_helper.py", line 139, in barrier dist_barrier(args, *kwargs) File "/home/EOD/up/utils/env/dist_helper.py", line 124, in dist_barrier dist.barrier(args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier work = default_pg.barrier(opts=opts) RuntimeError: CUDA error: initialization error CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Nina-yang commented 1 year ago

Meet the same issue, looking forward to some kind suggestion.

yqyao commented 1 year ago

Maybe you need to provide your environment like pytorch version, cuda version, etc. @Nina-yang @Lick0920

Nina-yang commented 1 year ago

Maybe you need to provide your environment like pytorch version, cuda version, etc. @Nina-yang @Lick0920

torch verison: 1.8.1+cu111, cuda version 11.1, I have a machine with 8 Tesla V100-SXM2 32G, however because of the error, I'm using only 1 GPU now. It takes too long to finish training.

yqyao commented 1 year ago

Maybe you can try to use python3.6.9 (our version), and we will try to reproduce your issue on our machines. @Nina-yang

Nina-yang commented 1 year ago

Maybe you can try to use python3.6.9 (our version), and we will try to reproduce your issue on our machines. @Nina-yang Thanks. Besides, my python version is 3.7.3

Leeon-K commented 1 year ago

Maybe you can try to use python3.6.9 (our version), and we will try to reproduce your issue on our machines. @Nina-yang My environment set is as follow: python3.8.10 torch 1.10.1+rocm4.1 torchvision 0.11.2+rocm4.1
I have tried to use python3.6.9, but a new issue is happened: `from future import annotations ^ SyntaxError: future feature annotations is not defined I searched for a solution to this problem. Let me upgrade python to >3.7....

yqyao commented 1 year ago

where is the code in our repo, I think you need to re-install pytorch. @Lick0920

Leeon-K commented 1 year ago

Thank you, I will try to re-install pytorch. Traceback (most recent call last):
File "/home/anaconda3/envs/up_py3.6/lib/python3.6/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/home/anaconda3/envs/up_py3.6/lib/python3.6/runpy.py", line 142, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/home/anaconda3/envs/up_py3.6/lib/python3.6/runpy.py", line 109, in _get_module_details
import(pkg_name)
File "/home/EOD/up/init.py", line 21, in
from .commands import *
File "/home/EOD/up/commands/init.py", line 5, in
from .flops import Flops # noqa
File "/home/EOD/up/commands/flops.py", line 5, in
from prettytable import PrettyTable
File "/home/.local/lib/python3.6/site-packages/prettytable-3.6.0-py3.6.egg/prettytable/init.py", line 1
from future import annotations

hxy0307 commented 1 year ago

Maybe you can try to use python3.6.9 (our version), and we will try to reproduce your issue on our machines. @Nina-yang My environment set is as follow: python3.8.10 torch 1.10.1+rocm4.1 torchvision 0.11.2+rocm4.1 I have tried to use python3.6.9, but a new issue is happened: `from future import annotations ^ SyntaxError: future feature annotations is not defined I searched for a solution to this problem. Let me upgrade python to >3.7.... 您好,我现在使用的也是adm的显卡,遇到了numba无法在adm的显卡上运行,运行代码时会遇到CUDA_ERROR_NOT_INITIALIZED的错误,请问您遇到相同的错误了吗?