你好,请教一下,训练的时候,出现如下问题:
cd Chinese-CLIP/ bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ${DATAPATH}
出现下面的问题:
root@clip-test-d9cd48656-q2zbl:~/workspace/clip/Chinese-CLIP# bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ../clip_set/ /usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects--local_rankargument to be set, please change it to read fromos.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Traceback (most recent call last):
File "cn_clip/training/main.py", line 300, in
main()
File "cn_clip/training/main.py", line 54, in main
torch.cuda.set_device(args.local_device_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Traceback (most recent call last):
File "cn_clip/training/main.py", line 300, in
main()
File "cn_clip/training/main.py", line 54, in main
torch.cuda.set_device(args.local_device_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Traceback (most recent call last):
File "cn_clip/training/main.py", line 300, in
main()
File "cn_clip/training/main.py", line 54, in main
torch.cuda.set_device(args.local_device_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Traceback (most recent call last):
File "cn_clip/training/main.py", line 300, in
main()
File "cn_clip/training/main.py", line 54, in main
torch.cuda.set_device(args.local_device_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Traceback (most recent call last):
File "cn_clip/training/main.py", line 300, in
main()
File "cn_clip/training/main.py", line 54, in main
torch.cuda.set_device(args.local_device_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Traceback (most recent call last):
File "cn_clip/training/main.py", line 300, in
main()
File "cn_clip/training/main.py", line 54, in main
torch.cuda.set_device(args.local_device_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Traceback (most recent call last):
File "cn_clip/training/main.py", line 300, in
main()
File "cn_clip/training/main.py", line 54, in main
torch.cuda.set_device(args.local_device_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 722 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 723) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
你好,请教一下,训练的时候,出现如下问题:
cd Chinese-CLIP/ bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ${DATAPATH}
出现下面的问题:
root@clip-test-d9cd48656-q2zbl:~/workspace/clip/Chinese-CLIP# bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ../clip_set/ /usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rankargument to be set, please change it to read from
os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Traceback (most recent call last): File "cn_clip/training/main.py", line 300, in
main()
File "cn_clip/training/main.py", line 54, in main
torch.cuda.set_device(args.local_device_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Traceback (most recent call last):
File "cn_clip/training/main.py", line 300, in
main()
File "cn_clip/training/main.py", line 54, in main
torch.cuda.set_device(args.local_device_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Traceback (most recent call last):
File "cn_clip/training/main.py", line 300, in
main()
File "cn_clip/training/main.py", line 54, in main
torch.cuda.set_device(args.local_device_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Traceback (most recent call last):
File "cn_clip/training/main.py", line 300, in
main()
File "cn_clip/training/main.py", line 54, in main
torch.cuda.set_device(args.local_device_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Traceback (most recent call last):
File "cn_clip/training/main.py", line 300, in
main()
File "cn_clip/training/main.py", line 54, in main
torch.cuda.set_device(args.local_device_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Traceback (most recent call last):
File "cn_clip/training/main.py", line 300, in
main()
File "cn_clip/training/main.py", line 54, in main
torch.cuda.set_device(args.local_device_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Traceback (most recent call last):
File "cn_clip/training/main.py", line 300, in
main()
File "cn_clip/training/main.py", line 54, in main
torch.cuda.set_device(args.local_device_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 722 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 723) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
cn_clip/training/main.py FAILED
Failures: [1]: time : 2023-02-21_09:58:00 host : clip-test-d9cd48656-q2zbl rank : 2 (local_rank: 2) exitcode : 1 (pid: 724) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-02-21_09:58:00 host : clip-test-d9cd48656-q2zbl rank : 3 (local_rank: 3) exitcode : 1 (pid: 725) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-02-21_09:58:00 host : clip-test-d9cd48656-q2zbl rank : 4 (local_rank: 4) exitcode : 1 (pid: 726) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2023-02-21_09:58:00 host : clip-test-d9cd48656-q2zbl rank : 5 (local_rank: 5) exitcode : 1 (pid: 727) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2023-02-21_09:58:00 host : clip-test-d9cd48656-q2zbl rank : 6 (local_rank: 6) exitcode : 1 (pid: 728) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2023-02-21_09:58:00 host : clip-test-d9cd48656-q2zbl rank : 7 (local_rank: 7) exitcode : 1 (pid: 729) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-02-21_09:58:00 host : clip-test-d9cd48656-q2zbl rank : 1 (local_rank: 1) exitcode : 1 (pid: 723) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================`
能看出是什么原因吗?