WisdomShell / codeshell-vscode

An intelligent coding assistant plugin for Visual Studio Code, developed based on CodeShell
Apache License 2.0
569 stars 70 forks source link

请问是否支持双卡设备部署 #37

Closed firslov closed 8 months ago

firslov commented 8 months ago
sudo docker run --gpus 'all' --shm-size 1g -p 9090:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data  --env LOG_LEVEL="info,text_generation_router=debug" ghcr.nju.edu.cn/huggingface/text-generation-inference:1.0.3 --model-id /data --num-shard 2 --max-total-tokens 5000 --max-input-length 4096 --max-stop-sequences 12
--trust-remote-code

设备为双RTX6000,CUDA版本12.2,执行报错:

2023-10-25T03:43:06.938048Z  INFO text_generation_launcher: Args { model_id: "/data", revision: None, validation_workers: 2, sharded: None, num_shard: Some(2), quantize: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 12, max_top_n_tokens: 5, max_input_length: 4096, max_total_tokens: 5000, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "02da084c587e", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-10-25T03:43:06.938115Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `/data` do not contain malicious code.
2023-10-25T03:43:06.938126Z  INFO text_generation_launcher: Sharding model on 2 processes
2023-10-25T03:43:06.938328Z  INFO download: text_generation_launcher: Starting download process.
2023-10-25T03:43:09.670454Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-10-25T03:43:10.042577Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-10-25T03:43:10.042982Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-10-25T03:43:10.043031Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2023-10-25T03:43:12.861796Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2023-10-25T03:43:12.881244Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2023-10-25T03:43:12.933483Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 266, in get_model
    raise ValueError("sharded is not supported for AutoModel")
ValueError: sharded is not supported for AutoModel

2023-10-25T03:43:12.952449Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 266, in get_model
    raise ValueError("sharded is not supported for AutoModel")
ValueError: sharded is not supported for AutoModel

2023-10-25T03:43:13.348876Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 266, in get_model
    raise ValueError("sharded is not supported for AutoModel")

ValueError: sharded is not supported for AutoModel
 rank=0
2023-10-25T03:43:13.446787Z ERROR text_generation_launcher: Shard 0 failed to start
2023-10-25T03:43:13.446824Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
2023-10-25T03:43:13.448962Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 266, in get_model
    raise ValueError("sharded is not supported for AutoModel")

ValueError: sharded is not supported for AutoModel
 rank=1
pk49800 commented 8 months ago

请问使用AMD单卡部署会报错吗?

ZZR0 commented 8 months ago

您好,非常感谢您对CodeShell项目的支持。现阶段,由于官方TGI项目尚未开始原生地支持CodeShell模型,因此使用官方镜像并不能实现对CodeShell模型的多卡部署。

如果您需要实现多卡部署CodeShell模型,我们邀您试用我们的TGI-CodeShell分支。这一分支成功地实现了CodeShell模型与TGI推理框架的原生对接,以满足CodeShell模型多卡部署的需求。

参照TGI-CodeShell文档,您可以在本地构建TGI环境或者构建TGI-CodeShell Docker镜像。另外,我们也提供了预构建的TGI-CodeShell镜像zzr0/text-generation-inference:codeshell-1.1.1,您可以在满足条件的情况下直接使用。

firslov commented 8 months ago

请问使用AMD单卡部署会报错吗?

没有测试过AMD的卡,Nvidia的单卡没有问题。

ZZR0 commented 8 months ago

请问使用AMD单卡部署会报错吗?

您好,根据TGI的该Issue,当前TGI似乎还不支持AMD设备。

firslov commented 8 months ago

您好,非常感谢您对CodeShell项目的支持。现阶段,由于官方TGI项目尚未开始原生地支持CodeShell模型,因此使用官方镜像并不能实现对CodeShell模型的多卡部署。

如果您需要实现多卡部署CodeShell模型,我们邀您试用我们的TGI-CodeShell分支。这一分支成功地实现了CodeShell模型与TGI推理框架的原生对接,以满足CodeShell模型多卡部署的需求。

参照TGI-CodeShell文档,您可以在本地构建TGI环境或者构建TGI-CodeShell Docker镜像。另外,我们也提供了预构建的TGI-CodeShell镜像zzr0/text-generation-inference:codeshell-1.1.1,您可以在满足条件的情况下直接使用。

sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1 --model-id WisdomShell/CodeShell-7B-Chat --trust-remote-code
Unable to find image 'ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1' locally
docker: Error response from daemon: Head "https://ghcr.io/v2/zzr0/text-generation-inference/manifests/codeshell-1.1.1": denied.

请问您所说的 docker image 需要自行构建吗,在该仓库里未找到对应的dockerfile

ZZR0 commented 8 months ago

您好,非常感谢您对CodeShell项目的支持。现阶段,由于官方TGI项目尚未开始原生地支持CodeShell模型,因此使用官方镜像并不能实现对CodeShell模型的多卡部署。 如果您需要实现多卡部署CodeShell模型,我们邀您试用我们的TGI-CodeShell分支。这一分支成功地实现了CodeShell模型与TGI推理框架的原生对接,以满足CodeShell模型多卡部署的需求。 参照TGI-CodeShell文档,您可以在本地构建TGI环境或者构建TGI-CodeShell Docker镜像。另外,我们也提供了预构建的TGI-CodeShell镜像zzr0/text-generation-inference:codeshell-1.1.1,您可以在满足条件的情况下直接使用。

sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1 --model-id WisdomShell/CodeShell-7B-Chat --trust-remote-code
Unable to find image 'ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1' locally
docker: Error response from daemon: Head "https://ghcr.io/v2/zzr0/text-generation-inference/manifests/codeshell-1.1.1": denied.

请问您所说的 docker image 需要自行构建吗,在该仓库里未找到对应的dockerfile

抱歉,请直接使用zzr0/text-generation-inference:codeshell-1.1.1镜像而不是ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1

firslov commented 8 months ago

您好,非常感谢您对CodeShell项目的支持。现阶段,由于官方TGI项目尚未开始原生地支持CodeShell模型,因此使用官方镜像并不能实现对CodeShell模型的多卡部署。 如果您需要实现多卡部署CodeShell模型,我们邀您试用我们的TGI-CodeShell分支。这一分支成功地实现了CodeShell模型与TGI推理框架的原生对接,以满足CodeShell模型多卡部署的需求。 参照TGI-CodeShell文档,您可以在本地构建TGI环境或者构建TGI-CodeShell Docker镜像。另外,我们也提供了预构建的TGI-CodeShell镜像zzr0/text-generation-inference:codeshell-1.1.1,您可以在满足条件的情况下直接使用。

sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1 --model-id WisdomShell/CodeShell-7B-Chat --trust-remote-code
Unable to find image 'ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1' locally
docker: Error response from daemon: Head "https://ghcr.io/v2/zzr0/text-generation-inference/manifests/codeshell-1.1.1": denied.

请问您所说的 docker image 需要自行构建吗,在该仓库里未找到对应的dockerfile

抱歉,请直接使用zzr0/text-generation-inference:codeshell-1.1.1镜像而不是ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1

sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data zzr0/text-generation-inference:codeshell-1.1.1 --model-id /data --num-shard 2 --trust-remote-code

error message:
...
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 72, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 674, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

2023-11-02T12:41:07.657550Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`"))
2023-11-02T12:41:07.755521Z ERROR text_generation_launcher: Webserver Crashed
2023-11-02T12:41:07.755558Z  INFO text_generation_launcher: Shutting down shards
2023-11-02T12:41:08.022896Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1
Error: WebserverFailed
2023-11-02T12:41:08.080815Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0

双卡运行这个镜像时,报错 not enough memory,但是我是两张24G卡,监控显示显存占用到8G左右时就会报错

ZZR0 commented 8 months ago

您好,非常感谢您对CodeShell项目的支持。现阶段,由于官方TGI项目尚未开始原生地支持CodeShell模型,因此使用官方镜像并不能实现对CodeShell模型的多卡部署。 如果您需要实现多卡部署CodeShell模型,我们邀您试用我们的TGI-CodeShell分支。这一分支成功地实现了CodeShell模型与TGI推理框架的原生对接,以满足CodeShell模型多卡部署的需求。 参照TGI-CodeShell文档,您可以在本地构建TGI环境或者构建TGI-CodeShell Docker镜像。另外,我们也提供了预构建的TGI-CodeShell镜像zzr0/text-generation-inference:codeshell-1.1.1,您可以在满足条件的情况下直接使用。

sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1 --model-id WisdomShell/CodeShell-7B-Chat --trust-remote-code
Unable to find image 'ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1' locally
docker: Error response from daemon: Head "https://ghcr.io/v2/zzr0/text-generation-inference/manifests/codeshell-1.1.1": denied.

请问您所说的 docker image 需要自行构建吗,在该仓库里未找到对应的dockerfile

抱歉,请直接使用zzr0/text-generation-inference:codeshell-1.1.1镜像而不是ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1

sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data zzr0/text-generation-inference:codeshell-1.1.1 --model-id /data --num-shard 2 --trust-remote-code

error message:
...
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 72, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 674, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

2023-11-02T12:41:07.657550Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`"))
2023-11-02T12:41:07.755521Z ERROR text_generation_launcher: Webserver Crashed
2023-11-02T12:41:07.755558Z  INFO text_generation_launcher: Shutting down shards
2023-11-02T12:41:08.022896Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1
Error: WebserverFailed
2023-11-02T12:41:08.080815Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0

双卡运行这个镜像时,报错 not enough memory,但是我是两张24G卡,监控显示显存占用到8G左右时就会报错

抱歉,你可以pull最新的zzr0/text-generation-inference:codeshell-1.1.1镜像再试试吗?之前的镜像对多卡部署的支持还有点问题,我们已经在最新的镜像(sha256:e5d5e1fd...)中修复了。

firslov commented 8 months ago

您好,非常感谢您对CodeShell项目的支持。现阶段,由于官方TGI项目尚未开始原生地支持CodeShell模型,因此使用官方镜像并不能实现对CodeShell模型的多卡部署。 如果您需要实现多卡部署CodeShell模型,我们邀您试用我们的TGI-CodeShell分支。这一分支成功地实现了CodeShell模型与TGI推理框架的原生对接,以满足CodeShell模型多卡部署的需求。 参照TGI-CodeShell文档,您可以在本地构建TGI环境或者构建TGI-CodeShell Docker镜像。另外,我们也提供了预构建的TGI-CodeShell镜像zzr0/text-generation-inference:codeshell-1.1.1,您可以在满足条件的情况下直接使用。

sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1 --model-id WisdomShell/CodeShell-7B-Chat --trust-remote-code
Unable to find image 'ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1' locally
docker: Error response from daemon: Head "https://ghcr.io/v2/zzr0/text-generation-inference/manifests/codeshell-1.1.1": denied.

请问您所说的 docker image 需要自行构建吗,在该仓库里未找到对应的dockerfile

抱歉,请直接使用zzr0/text-generation-inference:codeshell-1.1.1镜像而不是ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1

sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data zzr0/text-generation-inference:codeshell-1.1.1 --model-id /data --num-shard 2 --trust-remote-code

error message:
...
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 72, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 674, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

2023-11-02T12:41:07.657550Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`"))
2023-11-02T12:41:07.755521Z ERROR text_generation_launcher: Webserver Crashed
2023-11-02T12:41:07.755558Z  INFO text_generation_launcher: Shutting down shards
2023-11-02T12:41:08.022896Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1
Error: WebserverFailed
2023-11-02T12:41:08.080815Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0

双卡运行这个镜像时,报错 not enough memory,但是我是两张24G卡,监控显示显存占用到8G左右时就会报错

抱歉,你可以pull最新的zzr0/text-generation-inference:codeshell-1.1.1镜像再试试吗?之前的镜像对多卡部署的支持还有点问题,我们已经在最新的镜像(sha256:e5d5e1fd...)中修复了。

新的镜像可以运行了,但是双卡各占用19G显存,请问是采用的数据并行方式吗?是否支持模型并行?

ZZR0 commented 8 months ago

您好,非常感谢您对CodeShell项目的支持。现阶段,由于官方TGI项目尚未开始原生地支持CodeShell模型,因此使用官方镜像并不能实现对CodeShell模型的多卡部署。 如果您需要实现多卡部署CodeShell模型,我们邀您试用我们的TGI-CodeShell分支。这一分支成功地实现了CodeShell模型与TGI推理框架的原生对接,以满足CodeShell模型多卡部署的需求。 参照TGI-CodeShell文档,您可以在本地构建TGI环境或者构建TGI-CodeShell Docker镜像。另外,我们也提供了预构建的TGI-CodeShell镜像zzr0/text-generation-inference:codeshell-1.1.1,您可以在满足条件的情况下直接使用。

sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1 --model-id WisdomShell/CodeShell-7B-Chat --trust-remote-code
Unable to find image 'ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1' locally
docker: Error response from daemon: Head "https://ghcr.io/v2/zzr0/text-generation-inference/manifests/codeshell-1.1.1": denied.

请问您所说的 docker image 需要自行构建吗,在该仓库里未找到对应的dockerfile

抱歉,请直接使用zzr0/text-generation-inference:codeshell-1.1.1镜像而不是ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1

sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data zzr0/text-generation-inference:codeshell-1.1.1 --model-id /data --num-shard 2 --trust-remote-code

error message:
...
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 72, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 674, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

2023-11-02T12:41:07.657550Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`"))
2023-11-02T12:41:07.755521Z ERROR text_generation_launcher: Webserver Crashed
2023-11-02T12:41:07.755558Z  INFO text_generation_launcher: Shutting down shards
2023-11-02T12:41:08.022896Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1
Error: WebserverFailed
2023-11-02T12:41:08.080815Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0

双卡运行这个镜像时,报错 not enough memory,但是我是两张24G卡,监控显示显存占用到8G左右时就会报错

抱歉,你可以pull最新的zzr0/text-generation-inference:codeshell-1.1.1镜像再试试吗?之前的镜像对多卡部署的支持还有点问题,我们已经在最新的镜像(sha256:e5d5e1fd...)中修复了。

新的镜像可以运行了,但是双卡各占用19G显存,请问是采用的数据并行方式吗?是否支持模型并行?

你好,CodeShell当前的多卡推理方案已经是模型并行。显存占用过高是因为TGI推理时会预先占用满所有GPU显存以保证推理的稳定。

ai907303458 commented 3 weeks ago

您好,最新的 zzr0/text-generation-inference:shell-1.4.0 镜像下载有限制吗 报错 error pulling image configuration: download failed after attempts=6: dial tcp 108.160.169.185:443: i/o timeout