OrionStarAI / Orion

Orion-14B is a family of models includes a 14B foundation LLM, and a series of models: a chat model, a long context model, a quantized model, a RAG fine-tuned model, and an Agent fine-tuned model. Orion-14B 系列模型包括一个具有140亿参数的多语言基座大模型以及一系列相关的衍生模型,包括对话模型,长文本模型,量化模型,RAG微调模型,Agent微调模型等。
Apache License 2.0
785 stars 57 forks source link

use text-generation-inference not work #10

Closed ychy00001 closed 9 months ago

ychy00001 commented 9 months ago

I try to run with Text-Generation-Inference to start inference server. but show err (model: Orion-14B-Chat)

inference-server_1  | 2024-01-22T08:28:44.914823Z  INFO text_generation_launcher: Args { model_id: "/model", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "89adb99ea1ea", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
inference-server_1  | 2024-01-22T08:28:44.914899Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `/model` do not contain malicious code.
inference-server_1  | 2024-01-22T08:28:44.914912Z  INFO text_generation_launcher: Sharding model on 4 processes
inference-server_1  | 2024-01-22T08:28:44.915120Z  INFO download: text_generation_launcher: Starting download process.
inference-server_1  | 2024-01-22T08:28:52.352531Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
inference-server_1  |
inference-server_1  | 2024-01-22T08:28:53.237442Z  INFO download: text_generation_launcher: Successfully downloaded weights.
inference-server_1  | 2024-01-22T08:28:53.237923Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
inference-server_1  | 2024-01-22T08:28:53.237926Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
inference-server_1  | 2024-01-22T08:28:53.237972Z  INFO shard-manager: text_generation_launcher: Starting shard rank=3
inference-server_1  | 2024-01-22T08:28:53.237974Z  INFO shard-manager: text_generation_launcher: Starting shard rank=2
inference-server_1  | 2024-01-22T08:29:01.310087Z  WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding
inference-server_1  |
inference-server_1  | 2024-01-22T08:29:01.800102Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2
inference-server_1  |
inference-server_1  | 2024-01-22T08:29:01.969078Z  WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding
inference-server_1  |
inference-server_1  | 2024-01-22T08:29:02.113851Z ERROR text_generation_launcher: Error when initializing model
inference-server_1  | Traceback (most recent call last):
inference-server_1  |   File "/opt/conda/bin/text-generation-server", line 8, in <module>
inference-server_1  |     sys.exit(app())
inference-server_1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
inference-server_1  |     return get_command(self)(*args, **kwargs)
inference-server_1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
inference-server_1  |     return self.main(*args, **kwargs)
inference-server_1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
inference-server_1  |     return _main(
inference-server_1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
inference-server_1  |     rv = self.invoke(ctx)
inference-server_1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
inference-server_1  |     return _process_result(sub_ctx.command.invoke(sub_ctx))
inference-server_1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
inference-server_1  |     return ctx.invoke(self.callback, **ctx.params)
inference-server_1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
inference-server_1  |     return __callback(*args, **kwargs)
inference-server_1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
inference-server_1  |     return callback(**use_params)  # type: ignore
inference-server_1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
inference-server_1  |     server.serve(
inference-server_1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve
inference-server_1  |     asyncio.run(
inference-server_1  |   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
inference-server_1  |     return loop.run_until_complete(main)
inference-server_1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
inference-server_1  |     self.run_forever()
inference-server_1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
inference-server_1  |     self._run_once()
inference-server_1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
inference-server_1  |     handle._run()
inference-server_1  |   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
inference-server_1  |     self._context.run(self._callback, *self._args)
inference-server_1  | > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner
inference-server_1  |     model = get_model(
inference-server_1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 339, in get_model
inference-server_1  |     raise NotImplementedError("sharded is not supported for AutoModel")
inference-server_1  | NotImplementedError: sharded is not supported for AutoModel

... ...

inference-server_1  |  rank=0
inference-server_1  | 2024-01-22T08:29:03.383049Z ERROR text_generation_launcher: Shard 0 failed to start
inference-server_1  | 2024-01-22T08:29:03.383086Z  INFO text_generation_launcher: Shutting down shards
inference-server_1  | 2024-01-22T08:29:03.595368Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=2
inference-server_1  | 2024-01-22T08:29:03.607757Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1
inference-server_1  | 2024-01-22T08:29:03.678409Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=3
inference-server_1  | Error: ShardCannotStart