请问是否支持双卡设备部署

sudo docker run --gpus 'all' --shm-size 1g -p 9090:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data  --env LOG_LEVEL="info,text_generation_router=debug" ghcr.nju.edu.cn/huggingface/text-generation-inference:1.0.3 --model-id /data --num-shard 2 --max-total-tokens 5000 --max-input-length 4096 --max-stop-sequences 12
--trust-remote-code

设备为双RTX6000，CUDA版本12.2，执行报错：

2023-10-25T03:43:06.938048Z  INFO text_generation_launcher: Args { model_id: "/data", revision: None, validation_workers: 2, sharded: None, num_shard: Some(2), quantize: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 12, max_top_n_tokens: 5, max_input_length: 4096, max_total_tokens: 5000, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "02da084c587e", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-10-25T03:43:06.938115Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `/data` do not contain malicious code.
2023-10-25T03:43:06.938126Z  INFO text_generation_launcher: Sharding model on 2 processes
2023-10-25T03:43:06.938328Z  INFO download: text_generation_launcher: Starting download process.
2023-10-25T03:43:09.670454Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-10-25T03:43:10.042577Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-10-25T03:43:10.042982Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-10-25T03:43:10.043031Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2023-10-25T03:43:12.861796Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2023-10-25T03:43:12.881244Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2023-10-25T03:43:12.933483Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 266, in get_model
    raise ValueError("sharded is not supported for AutoModel")
ValueError: sharded is not supported for AutoModel

2023-10-25T03:43:12.952449Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 266, in get_model
    raise ValueError("sharded is not supported for AutoModel")
ValueError: sharded is not supported for AutoModel

2023-10-25T03:43:13.348876Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 266, in get_model
    raise ValueError("sharded is not supported for AutoModel")

ValueError: sharded is not supported for AutoModel
 rank=0
2023-10-25T03:43:13.446787Z ERROR text_generation_launcher: Shard 0 failed to start
2023-10-25T03:43:13.446824Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
2023-10-25T03:43:13.448962Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 266, in get_model
    raise ValueError("sharded is not supported for AutoModel")

ValueError: sharded is not supported for AutoModel
 rank=1

请问使用AMD单卡部署会报错吗？

您好，非常感谢您对CodeShell项目的支持。现阶段，由于官方TGI项目尚未开始原生地支持CodeShell模型，因此使用官方镜像并不能实现对CodeShell模型的多卡部署。

如果您需要实现多卡部署CodeShell模型，我们邀您试用我们的TGI-CodeShell分支。这一分支成功地实现了CodeShell模型与TGI推理框架的原生对接，以满足CodeShell模型多卡部署的需求。

参照TGI-CodeShell文档，您可以在本地构建TGI环境或者构建TGI-CodeShell Docker镜像。另外，我们也提供了预构建的TGI-CodeShell镜像zzr0/text-generation-inference:codeshell-1.1.1，您可以在满足条件的情况下直接使用。

请问使用AMD单卡部署会报错吗？

没有测试过AMD的卡，Nvidia的单卡没有问题。

请问使用AMD单卡部署会报错吗？

您好，根据TGI的该Issue，当前TGI似乎还不支持AMD设备。

您好，非常感谢您对CodeShell项目的支持。现阶段，由于官方TGI项目尚未开始原生地支持CodeShell模型，因此使用官方镜像并不能实现对CodeShell模型的多卡部署。

如果您需要实现多卡部署CodeShell模型，我们邀您试用我们的TGI-CodeShell分支。这一分支成功地实现了CodeShell模型与TGI推理框架的原生对接，以满足CodeShell模型多卡部署的需求。

参照TGI-CodeShell文档，您可以在本地构建TGI环境或者构建TGI-CodeShell Docker镜像。另外，我们也提供了预构建的TGI-CodeShell镜像zzr0/text-generation-inference:codeshell-1.1.1，您可以在满足条件的情况下直接使用。

sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1 --model-id WisdomShell/CodeShell-7B-Chat --trust-remote-code
Unable to find image 'ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1' locally
docker: Error response from daemon: Head "https://ghcr.io/v2/zzr0/text-generation-inference/manifests/codeshell-1.1.1": denied.

请问您所说的 docker image 需要自行构建吗，在该仓库里未找到对应的dockerfile

您好，非常感谢您对CodeShell项目的支持。现阶段，由于官方TGI项目尚未开始原生地支持CodeShell模型，因此使用官方镜像并不能实现对CodeShell模型的多卡部署。如果您需要实现多卡部署CodeShell模型，我们邀您试用我们的TGI-CodeShell分支。这一分支成功地实现了CodeShell模型与TGI推理框架的原生对接，以满足CodeShell模型多卡部署的需求。参照TGI-CodeShell文档，您可以在本地构建TGI环境或者构建TGI-CodeShell Docker镜像。另外，我们也提供了预构建的TGI-CodeShell镜像zzr0/text-generation-inference:codeshell-1.1.1，您可以在满足条件的情况下直接使用。
sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1 --model-id WisdomShell/CodeShell-7B-Chat --trust-remote-code
Unable to find image 'ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1' locally
docker: Error response from daemon: Head "https://ghcr.io/v2/zzr0/text-generation-inference/manifests/codeshell-1.1.1": denied.
请问您所说的 docker image 需要自行构建吗，在该仓库里未找到对应的dockerfile

抱歉，请直接使用zzr0/text-generation-inference:codeshell-1.1.1镜像而不是ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1

您好，非常感谢您对CodeShell项目的支持。现阶段，由于官方TGI项目尚未开始原生地支持CodeShell模型，因此使用官方镜像并不能实现对CodeShell模型的多卡部署。如果您需要实现多卡部署CodeShell模型，我们邀您试用我们的TGI-CodeShell分支。这一分支成功地实现了CodeShell模型与TGI推理框架的原生对接，以满足CodeShell模型多卡部署的需求。参照TGI-CodeShell文档，您可以在本地构建TGI环境或者构建TGI-CodeShell Docker镜像。另外，我们也提供了预构建的TGI-CodeShell镜像zzr0/text-generation-inference:codeshell-1.1.1，您可以在满足条件的情况下直接使用。
sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1 --model-id WisdomShell/CodeShell-7B-Chat --trust-remote-code
Unable to find image 'ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1' locally
docker: Error response from daemon: Head "https://ghcr.io/v2/zzr0/text-generation-inference/manifests/codeshell-1.1.1": denied.
请问您所说的 docker image 需要自行构建吗，在该仓库里未找到对应的dockerfile
抱歉，请直接使用zzr0/text-generation-inference:codeshell-1.1.1镜像而不是ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1

sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data zzr0/text-generation-inference:codeshell-1.1.1 --model-id /data --num-shard 2 --trust-remote-code

error message:
...
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 72, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 674, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

2023-11-02T12:41:07.657550Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`"))
2023-11-02T12:41:07.755521Z ERROR text_generation_launcher: Webserver Crashed
2023-11-02T12:41:07.755558Z  INFO text_generation_launcher: Shutting down shards
2023-11-02T12:41:08.022896Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1
Error: WebserverFailed
2023-11-02T12:41:08.080815Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0

双卡运行这个镜像时，报错 not enough memory，但是我是两张24G卡，监控显示显存占用到8G左右时就会报错

您好，非常感谢您对CodeShell项目的支持。现阶段，由于官方TGI项目尚未开始原生地支持CodeShell模型，因此使用官方镜像并不能实现对CodeShell模型的多卡部署。如果您需要实现多卡部署CodeShell模型，我们邀您试用我们的TGI-CodeShell分支。这一分支成功地实现了CodeShell模型与TGI推理框架的原生对接，以满足CodeShell模型多卡部署的需求。参照TGI-CodeShell文档，您可以在本地构建TGI环境或者构建TGI-CodeShell Docker镜像。另外，我们也提供了预构建的TGI-CodeShell镜像zzr0/text-generation-inference:codeshell-1.1.1，您可以在满足条件的情况下直接使用。
sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1 --model-id WisdomShell/CodeShell-7B-Chat --trust-remote-code
Unable to find image 'ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1' locally
docker: Error response from daemon: Head "https://ghcr.io/v2/zzr0/text-generation-inference/manifests/codeshell-1.1.1": denied.
请问您所说的 docker image 需要自行构建吗，在该仓库里未找到对应的dockerfile
抱歉，请直接使用zzr0/text-generation-inference:codeshell-1.1.1镜像而不是ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1
sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data zzr0/text-generation-inference:codeshell-1.1.1 --model-id /data --num-shard 2 --trust-remote-code

error message:
...
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 72, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 674, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

2023-11-02T12:41:07.657550Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`"))
2023-11-02T12:41:07.755521Z ERROR text_generation_launcher: Webserver Crashed
2023-11-02T12:41:07.755558Z  INFO text_generation_launcher: Shutting down shards
2023-11-02T12:41:08.022896Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1
Error: WebserverFailed
2023-11-02T12:41:08.080815Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0
双卡运行这个镜像时，报错 not enough memory，但是我是两张24G卡，监控显示显存占用到8G左右时就会报错

抱歉，你可以pull最新的zzr0/text-generation-inference:codeshell-1.1.1镜像再试试吗？之前的镜像对多卡部署的支持还有点问题，我们已经在最新的镜像(sha256:e5d5e1fd...)中修复了。

您好，非常感谢您对CodeShell项目的支持。现阶段，由于官方TGI项目尚未开始原生地支持CodeShell模型，因此使用官方镜像并不能实现对CodeShell模型的多卡部署。如果您需要实现多卡部署CodeShell模型，我们邀您试用我们的TGI-CodeShell分支。这一分支成功地实现了CodeShell模型与TGI推理框架的原生对接，以满足CodeShell模型多卡部署的需求。参照TGI-CodeShell文档，您可以在本地构建TGI环境或者构建TGI-CodeShell Docker镜像。另外，我们也提供了预构建的TGI-CodeShell镜像zzr0/text-generation-inference:codeshell-1.1.1，您可以在满足条件的情况下直接使用。
sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1 --model-id WisdomShell/CodeShell-7B-Chat --trust-remote-code
Unable to find image 'ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1' locally
docker: Error response from daemon: Head "https://ghcr.io/v2/zzr0/text-generation-inference/manifests/codeshell-1.1.1": denied.
请问您所说的 docker image 需要自行构建吗，在该仓库里未找到对应的dockerfile
抱歉，请直接使用zzr0/text-generation-inference:codeshell-1.1.1镜像而不是ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1
sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data zzr0/text-generation-inference:codeshell-1.1.1 --model-id /data --num-shard 2 --trust-remote-code

error message:
...
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 72, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 674, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

2023-11-02T12:41:07.657550Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`"))
2023-11-02T12:41:07.755521Z ERROR text_generation_launcher: Webserver Crashed
2023-11-02T12:41:07.755558Z  INFO text_generation_launcher: Shutting down shards
2023-11-02T12:41:08.022896Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1
Error: WebserverFailed
2023-11-02T12:41:08.080815Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0
双卡运行这个镜像时，报错 not enough memory，但是我是两张24G卡，监控显示显存占用到8G左右时就会报错
抱歉，你可以pull最新的zzr0/text-generation-inference:codeshell-1.1.1镜像再试试吗？之前的镜像对多卡部署的支持还有点问题，我们已经在最新的镜像(sha256:e5d5e1fd...)中修复了。

新的镜像可以运行了，但是双卡各占用19G显存，请问是采用的数据并行方式吗？是否支持模型并行？

您好，非常感谢您对CodeShell项目的支持。现阶段，由于官方TGI项目尚未开始原生地支持CodeShell模型，因此使用官方镜像并不能实现对CodeShell模型的多卡部署。如果您需要实现多卡部署CodeShell模型，我们邀您试用我们的TGI-CodeShell分支。这一分支成功地实现了CodeShell模型与TGI推理框架的原生对接，以满足CodeShell模型多卡部署的需求。参照TGI-CodeShell文档，您可以在本地构建TGI环境或者构建TGI-CodeShell Docker镜像。另外，我们也提供了预构建的TGI-CodeShell镜像zzr0/text-generation-inference:codeshell-1.1.1，您可以在满足条件的情况下直接使用。
sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1 --model-id WisdomShell/CodeShell-7B-Chat --trust-remote-code
Unable to find image 'ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1' locally
docker: Error response from daemon: Head "https://ghcr.io/v2/zzr0/text-generation-inference/manifests/codeshell-1.1.1": denied.
请问您所说的 docker image 需要自行构建吗，在该仓库里未找到对应的dockerfile
抱歉，请直接使用zzr0/text-generation-inference:codeshell-1.1.1镜像而不是ghcr.io/zzr0/text-generation-inference:codeshell-1.1.1
sudo docker run --gpus all --shm-size 1g -p 6668:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data zzr0/text-generation-inference:codeshell-1.1.1 --model-id /data --num-shard 2 --trust-remote-code

error message:
...
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 72, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 674, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

2023-11-02T12:41:07.657550Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`"))
2023-11-02T12:41:07.755521Z ERROR text_generation_launcher: Webserver Crashed
2023-11-02T12:41:07.755558Z  INFO text_generation_launcher: Shutting down shards
2023-11-02T12:41:08.022896Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1
Error: WebserverFailed
2023-11-02T12:41:08.080815Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0
双卡运行这个镜像时，报错 not enough memory，但是我是两张24G卡，监控显示显存占用到8G左右时就会报错
抱歉，你可以pull最新的zzr0/text-generation-inference:codeshell-1.1.1镜像再试试吗？之前的镜像对多卡部署的支持还有点问题，我们已经在最新的镜像(sha256:e5d5e1fd...)中修复了。
新的镜像可以运行了，但是双卡各占用19G显存，请问是采用的数据并行方式吗？是否支持模型并行？

你好，CodeShell当前的多卡推理方案已经是模型并行。显存占用过高是因为TGI推理时会预先占用满所有GPU显存以保证推理的稳定。

您好，最新的 zzr0/text-generation-inference:shell-1.4.0 镜像下载有限制吗报错 error pulling image configuration: download failed after attempts=6: dial tcp 108.160.169.185:443: i/o timeout

WisdomShell / codeshell-vscode

请问是否支持双卡设备部署 #37