666webai commented 11 months ago

buntu 22.04 环境用的 anaconda python版本 3.10 cuda 11.8 硬件4张p104-100 8gb 模型已经下载到了本地

参考 https://github.com/THUDM/VisualGLM-6B/issues/102 请问visualglm-6b可以多卡部署吗

因为卡的显存有限不能直接使用没有量化的模型在/VisualGLM-6B目录下可以正常运行 python web_demo_hf.py --quant 4 --share 到上面这里是没有问题的能够打开网页

但是我打算多卡部署的时候遇到了问题

torchrun --nnode 1 --nproc_per_node= 4 web_demo_hf.py --quant 4 --share

Traceback (most recent call last): File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 632, in determine_local_world_size return int(nproc_per_node) ValueError: invalid literal for int() with base 10: ''

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/wtchen/anaconda3/envs/v6b2/bin/torchrun", line 8, in sys.exit(main()) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 784, in run config, cmd, cmd_args = config_from_args(args) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 691, in config_from_args nproc_per_node = determine_local_world_size(args.nproc_per_node) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 650, in determine_local_world_size raise ValueError(f"Unsupported nproc_per_node value: {nproc_per_node}") from e ValueError: Unsupported nproc_per_node value:

1049451037 commented 11 months ago

等号和4之间多了个空格吧

666webai commented 11 months ago

好的我试一下

666webai commented 11 months ago

torchrun --nnode 1 --nproc_per_node=4 web_demo_hf.py --quant 4 --share WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

[2023-09-20 02:54:07,442] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 02:54:07,442] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 02:54:07,443] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 02:54:07,443] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 02:54:16,232] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-09-20 02:54:16,283] [INFO] [RANK 0] You are using model-only mode. For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.

/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") Loading checkpoint shards: 60%|████████████████████████████████████████████████████▏ | 3/5 [01:33<01:05, 32.98s/it]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 913 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 914 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 915 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 916) of binary: /home/wtchen/anaconda3/envs/v6b2/bin/python Traceback (most recent call last): File "/home/wtchen/anaconda3/envs/v6b2/bin/torchrun", line 8, in sys.exit(main()) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

web_demo_hf.py FAILED

Failures:

---------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-09-20_02:56:44 host : wtcai4x104 rank : 3 (local_rank: 3) exitcode : -9 (pid: 916) error_file: traceback : Signal 9 (SIGKILL) received by PID 916

666webai commented 11 months ago

内存满了但是swap还有很多的空间但是就是被kill了

666webai commented 11 months ago

去掉空格确实是解决了一部分问题后来我挂了更大的虚拟内存解决被kill的问题但还是不行我过后再试试 torchrun --nnode 1 --nproc_per_node=4 web_demo_hf.py --quant 4 --share WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

[2023-09-20 04:27:47,668] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 04:27:47,668] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 04:27:47,668] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 04:27:47,668] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 04:27:56,387] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-09-20 04:27:56,431] [INFO] [RANK 0] You are using model-only mode. For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK. /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 5/5 [06:11<00:00, 74.26s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 5/5 [06:11<00:00, 74.28s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 5/5 [06:12<00:00, 74.41s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 5/5 [06:12<00:00, 74.43s/it]

/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:104: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead. image_path = gr.Image(type="filepath", label="Image Prompt", value=None).style(height=504) /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:106: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead. chatbot = gr.Chatbot().style(height=480) /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:116: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead. user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=4).style( 3.44.4 Running on local URL: http://0.0.0.0:9088 /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:104: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead. image_path = gr.Image(type="filepath", label="Image Prompt", value=None).style(height=504) /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:106: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead. chatbot = gr.Chatbot().style(height=480) /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:116: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead. user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=4).style( 3.44.4 Traceback (most recent call last): File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 143, in main(args) File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 135, in main demo.queue().launch(share=args.share, inbrowser=True, server_name='0.0.0.0', server_port=9088) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/blocks.py", line 1907, in launch ) = networking.start_server( File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/networking.py", line 207, in start_server raise OSError( OSError: Cannot find empty port in range: 9088-9088. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the `server_port` parameter to `launch()`. /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:104: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead. image_path = gr.Image(type="filepath", label="Image Prompt", value=None).style(height=504) /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:106: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead. chatbot = gr.Chatbot().style(height=480) /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:116: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead. user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=4).style( 3.44.4 Traceback (most recent call last): File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 143, in main(args) File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 135, in main demo.queue().launch(share=args.share, inbrowser=True, server_name='0.0.0.0', server_port=9088) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/blocks.py", line 1907, in launch ) = networking.start_server( File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/networking.py", line 207, in start_server raise OSError( OSError: Cannot find empty port in range: 9088-9088. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the `server_port` parameter to `launch()`. /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:104: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead. image_path = gr.Image(type="filepath", label="Image Prompt", value=None).style(height=504) /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:106: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead. chatbot = gr.Chatbot().style(height=480) /home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:116: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead. user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=4).style( 3.44.4 Traceback (most recent call last): File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 143, in main(args) File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 135, in main demo.queue().launch(share=args.share, inbrowser=True, server_name='0.0.0.0', server_port=9088) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/blocks.py", line 1907, in launch ) = networking.start_server( File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/networking.py", line 207, in start_server raise OSError( OSError: Cannot find empty port in range: 9088-9088. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the `server_port` parameter to `launch()`. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1657 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1655) of binary: /home/wtchen/anaconda3/envs/v6b2/bin/python Traceback (most recent call last): File "/home/wtchen/anaconda3/envs/v6b2/bin/torchrun", line 8, in sys.exit(main()) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

web_demo_hf.py FAILED

Failures: [1]: time : 2023-09-20_04:43:15 host : wtcai4x104 rank : 1 (local_rank: 1) exitcode : 1 (pid: 1656) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-09-20_04:43:15 host : wtcai4x104 rank : 3 (local_rank: 3) exitcode : 1 (pid: 1658) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-09-20_04:43:15 host : wtcai4x104 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1655) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

corkiyao commented 2 weeks ago

内存满了但是swap还有很多的空间但是就是被kill了

老哥，现在你能多卡部署推理了吗？我也遇到这个问题，无法解决。

THUDM / VisualGLM-6B

我在使用多卡的时候遇到了问题 #274

web_demo_hf.py FAILED

web_demo_hf.py FAILED