THUDM / VisualGLM-6B

Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型
Apache License 2.0
4.07k stars 414 forks source link

我在使用多卡的时候遇到了问题 #274

Open 666webai opened 11 months ago

666webai commented 11 months ago

buntu 22.04 环境用的 anaconda python版本 3.10 cuda 11.8 硬件4张p104-100 8gb 模型已经下载到了本地

参考 https://github.com/THUDM/VisualGLM-6B/issues/102 请问visualglm-6b可以多卡部署吗

因为卡的显存有限不能直接使用没有量化的模型 在/VisualGLM-6B目录下可以 正常运行 python web_demo_hf.py --quant 4 --share 到上面这里是没有问题的 能够打开网页

但是我打算多卡部署的时候遇到了问题

torchrun --nnode 1 --nproc_per_node= 4 web_demo_hf.py --quant 4 --share

Traceback (most recent call last): File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 632, in determine_local_world_size return int(nproc_per_node) ValueError: invalid literal for int() with base 10: ''

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/wtchen/anaconda3/envs/v6b2/bin/torchrun", line 8, in sys.exit(main()) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 784, in run config, cmd, cmd_args = config_from_args(args) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 691, in config_from_args nproc_per_node = determine_local_world_size(args.nproc_per_node) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 650, in determine_local_world_size raise ValueError(f"Unsupported nproc_per_node value: {nproc_per_node}") from e ValueError: Unsupported nproc_per_node value:

1049451037 commented 11 months ago

等号和4之间多了个空格吧

666webai commented 11 months ago

好的我试一下

666webai commented 11 months ago

torchrun --nnode 1 --nproc_per_node=4 web_demo_hf.py --quant 4 --share WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


[2023-09-20 02:54:07,442] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 02:54:07,442] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 02:54:07,443] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 02:54:07,443] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 02:54:16,232] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-09-20 02:54:16,283] [INFO] [RANK 0] You are using model-only mode. For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.

/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") Loading checkpoint shards: 60%|████████████████████████████████████████████████████▏ | 3/5 [01:33<01:05, 32.98s/it]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 913 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 914 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 915 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 916) of binary: /home/wtchen/anaconda3/envs/v6b2/bin/python Traceback (most recent call last): File "/home/wtchen/anaconda3/envs/v6b2/bin/torchrun", line 8, in sys.exit(main()) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

web_demo_hf.py FAILED

Failures:

---------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-09-20_02:56:44 host : wtcai4x104 rank : 3 (local_rank: 3) exitcode : -9 (pid: 916) error_file: traceback : Signal 9 (SIGKILL) received by PID 916
666webai commented 11 months ago

内存满了 但是swap还有很多的空间 但是就是被kill了

666webai commented 11 months ago

去掉空格 确实是解决了一部分问题 后来我挂了更大的虚拟内存解决被kill的问题 但还是不行 我过后再试试 torchrun --nnode 1 --nproc_per_node=4 web_demo_hf.py --quant 4 --share WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


[2023-09-20 04:27:47,668] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 04:27:47,668] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 04:27:47,668] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 04:27:47,668] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 04:27:56,387] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-09-20 04:27:56,431] [INFO] [RANK 0] You are using model-only mode. For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK. /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 5/5 [06:11<00:00, 74.26s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 5/5 [06:11<00:00, 74.28s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 5/5 [06:12<00:00, 74.41s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 5/5 [06:12<00:00, 74.43s/it]

/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:104: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead. image_path = gr.Image(type="filepath", label="Image Prompt", value=None).style(height=504) /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:106: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead. chatbot = gr.Chatbot().style(height=480) /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:116: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead. user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=4).style( 3.44.4 Running on local URL: http://0.0.0.0:9088 /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:104: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead. image_path = gr.Image(type="filepath", label="Image Prompt", value=None).style(height=504) /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:106: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead. chatbot = gr.Chatbot().style(height=480) /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:116: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead. user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=4).style( 3.44.4 Traceback (most recent call last): File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 143, in main(args) File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 135, in main demo.queue().launch(share=args.share, inbrowser=True, server_name='0.0.0.0', server_port=9088) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/blocks.py", line 1907, in launch ) = networking.start_server( File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/networking.py", line 207, in start_server raise OSError( OSError: Cannot find empty port in range: 9088-9088. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the server_port parameter to launch(). /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:104: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead. image_path = gr.Image(type="filepath", label="Image Prompt", value=None).style(height=504) /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:106: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead. chatbot = gr.Chatbot().style(height=480) /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:116: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead. user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=4).style( 3.44.4 Traceback (most recent call last): File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 143, in main(args) File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 135, in main demo.queue().launch(share=args.share, inbrowser=True, server_name='0.0.0.0', server_port=9088) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/blocks.py", line 1907, in launch ) = networking.start_server( File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/networking.py", line 207, in start_server raise OSError( OSError: Cannot find empty port in range: 9088-9088. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the server_port parameter to launch(). /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:104: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead. image_path = gr.Image(type="filepath", label="Image Prompt", value=None).style(height=504) /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:106: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead. chatbot = gr.Chatbot().style(height=480) /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:116: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead. user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=4).style( 3.44.4 Traceback (most recent call last): File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 143, in main(args) File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 135, in main demo.queue().launch(share=args.share, inbrowser=True, server_name='0.0.0.0', server_port=9088) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/blocks.py", line 1907, in launch ) = networking.start_server( File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/networking.py", line 207, in start_server raise OSError( OSError: Cannot find empty port in range: 9088-9088. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the server_port parameter to launch(). WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1657 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1655) of binary: /home/wtchen/anaconda3/envs/v6b2/bin/python Traceback (most recent call last): File "/home/wtchen/anaconda3/envs/v6b2/bin/torchrun", line 8, in sys.exit(main()) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

web_demo_hf.py FAILED

Failures: [1]: time : 2023-09-20_04:43:15 host : wtcai4x104 rank : 1 (local_rank: 1) exitcode : 1 (pid: 1656) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-09-20_04:43:15 host : wtcai4x104 rank : 3 (local_rank: 3) exitcode : 1 (pid: 1658) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-09-20_04:43:15 host : wtcai4x104 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1655) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

corkiyao commented 2 weeks ago

内存满了 但是swap还有很多的空间 但是就是被kill了

老哥,现在你能多卡部署推理了吗?我也遇到这个问题,无法解决。