OFA-Sys / Chinese-CLIP

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
MIT License
3.96k stars 419 forks source link

打包模型出现问题 #60

Open JeffMony opened 1 year ago

JeffMony commented 1 year ago

你好,请教一下,训练的时候,出现如下问题: cd Chinese-CLIP/ bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ${DATAPATH}

出现下面的问题: root@clip-test-d9cd48656-q2zbl:~/workspace/clip/Chinese-CLIP# bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ../clip_set/ /usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects--local_rankargument to be set, please change it to read fromos.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Traceback (most recent call last): File "cn_clip/training/main.py", line 300, in main() File "cn_clip/training/main.py", line 54, in main torch.cuda.set_device(args.local_device_rank) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Traceback (most recent call last): File "cn_clip/training/main.py", line 300, in main() File "cn_clip/training/main.py", line 54, in main torch.cuda.set_device(args.local_device_rank) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Traceback (most recent call last): File "cn_clip/training/main.py", line 300, in main() File "cn_clip/training/main.py", line 54, in main torch.cuda.set_device(args.local_device_rank) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Traceback (most recent call last): File "cn_clip/training/main.py", line 300, in main() File "cn_clip/training/main.py", line 54, in main torch.cuda.set_device(args.local_device_rank) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Traceback (most recent call last): File "cn_clip/training/main.py", line 300, in main() File "cn_clip/training/main.py", line 54, in main torch.cuda.set_device(args.local_device_rank) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Traceback (most recent call last): File "cn_clip/training/main.py", line 300, in main() File "cn_clip/training/main.py", line 54, in main torch.cuda.set_device(args.local_device_rank) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Traceback (most recent call last): File "cn_clip/training/main.py", line 300, in main() File "cn_clip/training/main.py", line 54, in main torch.cuda.set_device(args.local_device_rank) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 722 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 723) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in main() File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

cn_clip/training/main.py FAILED

Failures: [1]: time : 2023-02-21_09:58:00 host : clip-test-d9cd48656-q2zbl rank : 2 (local_rank: 2) exitcode : 1 (pid: 724) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-02-21_09:58:00 host : clip-test-d9cd48656-q2zbl rank : 3 (local_rank: 3) exitcode : 1 (pid: 725) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-02-21_09:58:00 host : clip-test-d9cd48656-q2zbl rank : 4 (local_rank: 4) exitcode : 1 (pid: 726) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2023-02-21_09:58:00 host : clip-test-d9cd48656-q2zbl rank : 5 (local_rank: 5) exitcode : 1 (pid: 727) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2023-02-21_09:58:00 host : clip-test-d9cd48656-q2zbl rank : 6 (local_rank: 6) exitcode : 1 (pid: 728) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2023-02-21_09:58:00 host : clip-test-d9cd48656-q2zbl rank : 7 (local_rank: 7) exitcode : 1 (pid: 729) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-02-21_09:58:00 host : clip-test-d9cd48656-q2zbl rank : 1 (local_rank: 1) exitcode : 1 (pid: 723) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================`

能看出是什么原因吗?

DtYXs commented 1 year ago

您好,您可以检查一下使用的GPU数量和sh脚本中GPUS_PER_NODE配置的是否一致。 假设您要在8张GPU上运行,您可以在脚本设置GPUS_PER_NODE=8,执行export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7后再执行脚本看看能否正常训练。 也可能安装的Pytorch版本和CUDA的版本不匹配,您可以检查一下。