Closed mate-huaboy closed 1 year ago
您好,感谢你富有启发性的工作。目前我的工作的一部分是基于您的工作的,但是我遇到了一个自己难以解决的问题。目前,我按照您的提示下载了必要的文件和配置好了相关环境,但是不同于您的系统,我的机器是单机单卡的,因此我修改了这行代码os.environ['CUDA_VISIBLE_DEVICES'] = '0',但是运行时代码在函数 per_processor中的
estimator = torch.nn.parallel.DistributedDataParallel(estimator, device_ids=[gpu], output_device=gpu, find_unused_parameters=True)
报了RuntimeError: CUDA error: no kernel image is available for execution on the device错误,我尝试解决这个问题但是没有成功。目前可以确信我的环境出错的概率很小,因为代码中我可以产生tensor并移到GUP0上,我对分布式训练不是很了解,我的疑问是相关代码是否可以通过简单的修改而应用到单机单卡系统中呢,希望能得到您的回复,万分感谢!! 下面是程序的详细信息 `init gps:0 Traceback (most recent call last): File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/lwh/.vscode-server/extensions/ms-python.python-2022.4.0/pythonFiles/lib/python/debugpy/main.py", line 45, in cli.main() File "/home/lwh/.vscode-server/extensions/ms-python.python-2022.4.0/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main run() File "/home/lwh/.vscode-server/extensions/ms-python.python-2022.4.0/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file runpy.run_path(target_as_str, run_name=compat.force_str("main")) File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/runpy.py", line 263, in run_path pkg_name=pkg_name, script_name=fname) File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/runpy.py", line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "train.py", line 476, in main() File "train.py", line 117, in main mp.spawn(per_processor, nprocs=opt.gpu_number, args=(opt,)) File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/lwh/data/ES6D_forMe/train.py", line 214, in per_processor estimator = torch.nn.parallel.DistributedDataParallel(estimator, device_ids=[gpu], output_device=gpu, find_unused_parameters=True) File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 333, in init self.broadcast_bucket_size) File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced dist._broadcast_coalesced(self.process_group, tensors, buffer_size) RuntimeError: CUDA error: no kernel image is available for execution on the device`
我发现是我的pytorch版本和cuda版本对应上不支持导致的问题,我使用的GPU是3080,cuda 11.4,使用下面语句重新安装pytorch版本即可:
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
希望可以帮到需要的人
您好,感谢你富有启发性的工作。目前我的工作的一部分是基于您的工作的,但是我遇到了一个自己难以解决的问题。目前,我按照您的提示下载了必要的文件和配置好了相关环境,但是不同于您的系统,我的机器是单机单卡的,因此我修改了这行代码os.environ['CUDA_VISIBLE_DEVICES'] = '0',但是运行时代码在函数 per_processor中的
cli.main()
File "/home/lwh/.vscode-server/extensions/ms-python.python-2022.4.0/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main
run()
File "/home/lwh/.vscode-server/extensions/ms-python.python-2022.4.0/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file
runpy.run_path(target_as_str, run_name=compat.force_str("main"))
File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "train.py", line 476, in
main()
File "train.py", line 117, in main
mp.spawn(per_processor, nprocs=opt.gpu_number, args=(opt,))
File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
estimator = torch.nn.parallel.DistributedDataParallel(estimator, device_ids=[gpu], output_device=gpu, find_unused_parameters=True)
报了RuntimeError: CUDA error: no kernel image is available for execution on the device错误,我尝试解决这个问题但是没有成功。目前可以确信我的环境出错的概率很小,因为代码中我可以产生tensor并移到GUP0上,我对分布式训练不是很了解,我的疑问是相关代码是否可以通过简单的修改而应用到单机单卡系统中呢,希望能得到您的回复,万分感谢!! 下面是程序的详细信息 `init gps:0 Traceback (most recent call last): File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/lwh/.vscode-server/extensions/ms-python.python-2022.4.0/pythonFiles/lib/python/debugpy/main.py", line 45, in-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/lwh/data/ES6D_forMe/train.py", line 214, in per_processor estimator = torch.nn.parallel.DistributedDataParallel(estimator, device_ids=[gpu], output_device=gpu, find_unused_parameters=True) File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 333, in init self.broadcast_bucket_size) File "/home/lwh/anaconda3/envs/mypose/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced dist._broadcast_coalesced(self.process_group, tensors, buffer_size) RuntimeError: CUDA error: no kernel image is available for execution on the device`