NVIDIA / nim-deploy

A collection of YAML files, Helm Charts, Operator code, and guides to act as an example reference implementation for NVIDIA NIM deployment.
https://build.nvidia.com/
Apache License 2.0
124 stars 50 forks source link

Unknown RoPE scaling type {scaling_type} #60

Closed test-1pro closed 2 weeks ago

test-1pro commented 1 month ago

I tried using the specs below:

But this error occurred. `

== NVIDIA Inference Microservice LLM NIM ==

NVIDIA Inference Microservice LLM NIM Version 1.0.0 Model: nim/meta/llama-3_1-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/#:~:text=This%20license%20agreement%20(%E2%80%9CAgreement%E2%80%9D,algorithms%2C%20parameters%2C%20configuration%20files%2C).

ADDITIONAL INFORMATION: Llama 3.1 Community License Agreement, Built with Llama. WARNING 07-31 06:02:32.825 caches.py:30] /mnt/models/cache is read-only, application may fail if model is not already present in cache

INFO 07-31 06:02:40.10 ngc_profile.py:222] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA. INFO 07-31 06:02:40.10 ngc_profile.py:224] Detected 1 compatible profile(s). INFO 07-31 06:02:40.11 ngc_injector.py:120] Valid profile: 5d32170c5db4f5df4ed38a91179afda2396797c1a6e62474318e1df405ea53ce (vllm-fp16-tp1) on GPUs [0] INFO 07-31 06:02:40.11 ngc_injector.py:174] Selected profile: 5d32170c5db4f5df4ed38a91179afda2396797c1a6e62474318e1df405ea53ce (vllm-fp16-tp1) INFO 07-31 06:02:40.26 ngc_injector.py:179] Profile metadata: feat_lora: false INFO 07-31 06:02:40.26 ngc_injector.py:179] Profile metadata: llm_engine: vllm INFO 07-31 06:02:40.26 ngc_injector.py:179] Profile metadata: precision: fp16 INFO 07-31 06:02:40.26 ngc_injector.py:179] Profile metadata: tp: 1 INFO 07-31 06:02:40.26 ngc_injector.py:199] Preparing model workspace. This step might download additional files to run the model. INFO 07-31 06:02:40.229 ngc_injector.py:214] Model workspace is now ready. It took 0.203 seconds INFO 07-31 06:02:40.230 launch.py:46] engine_world_size=1 INFO 07-31 06:02:40.230 launch.py:92] running command ['/opt/nim/llm/.venv/bin/python3', '-m', 'vllm_nvext.entrypoints.openai.api_server', '--served-model-name', 'meta/llama-3_1-8b-instruct', '--async-engine-args', '{"model": "/tmp/meta--llama-3_1-8b-instruct-ue01g3m7", "served_model_name": null, "tokenizer": "/tmp/meta--llama-3_1-8b-instruct-ue01g3m7", "skip_tokenizer_init": false, "tokenizer_mode": "auto", "trust_remote_code": false, "download_dir": null, "load_format": "auto", "dtype": "auto", "kv_cache_dtype": "auto", "quantization_param_path": null, "seed": 0, "max_model_len": null, "worker_use_ray": false, "distributed_executor_backend": "ray", "pipeline_parallel_size": 1, "tensor_parallel_size": 1, "max_parallel_loading_workers": null, "block_size": 16, "enable_prefix_caching": false, "disable_sliding_window": false, "use_v2_block_manager": false, "swap_space": 4, "gpu_memory_utilization": 0.9, "max_num_batched_tokens": null, "max_num_seqs": 256, "max_logprobs": 20, "disable_log_stats": false, "revision": null, "code_revision": null, "rope_scaling": null, "rope_theta": null, "tokenizer_revision": null, "quantization": null, "enforce_eager": false, "max_context_len_to_capture": null, "max_seq_len_to_capture": 8192, "disable_custom_all_reduce": false, "tokenizer_pool_size": 0, "tokenizer_pool_type": "ray", "tokenizer_pool_extra_config": null, "enable_lora": false, "max_loras": 8, "max_lora_rank": 32, "fully_sharded_loras": false, "lora_extra_vocab_size": 256, "long_lora_scaling_factors": null, "lora_dtype": "auto", "max_cpu_loras": 16, "peft_source": null, "peft_refresh_interval": null, "device": "auto", "ray_workers_use_nsight": false, "num_gpu_blocks_override": null, "num_lookahead_slots": 0, "model_loader_extra_config": null, "preemption_mode": null, "image_input_type": null, "image_token_id": null, "image_input_shape": null, "image_feature_size": null, "image_processor": null, "image_processor_revision": null, "disable_image_processor": false, "scheduler_delay_factor": 0.0, "enable_chunked_prefill": false, "guided_decoding_backend": "lm-format-enforcer", "speculative_model": null, "speculative_draft_tensor_parallel_size": null, "num_speculative_tokens": null, "speculative_max_model_len": null, "speculative_disable_by_batch_size": null, "ngram_prompt_lookup_max": null, "ngram_prompt_lookup_min": null, "qlora_adapter_name_or_path": null, "otlp_traces_endpoint": null, "engine_use_ray": false, "disable_log_requests": true, "max_log_len": null, "selected_gpus": [{"name": "NVIDIA L40", "device_index": 0, "device_id": "26b5:10de", "total_memory": 51527024640, "free_memory": 50784501760, "used_memory": 7798784, "reserved_memory": 734724096, "family": "L40S"}]}'] [1722405763.279727] [llama-3-1-8b-predictor-00001-deployment-749cc774d-gmvgs:49 :0] parser.c:2305 UCX WARN unused environment variables: UCX_HOME; UCX_DIR (maybe: UCX_TLS?) [1722405763.279727] [llama-3-1-8b-predictor-00001-deployment-749cc774d-gmvgs:49 :0] parser.c:2305 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2024-07-31 06:02:49,171 [INFO] PyTorch version 2.3.0 available. 2024-07-31 06:02:52,439 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error 2024-07-31 06:02:52,439 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init. 2024-07-31 06:02:52,447 [INFO] [TRT-LLM] [I] TensorRT-LLM inited. [TensorRT-LLM] TensorRT-LLM version: 0.11.1.dev20240721 INFO 07-31 06:02:52.492 api_server.py:625] NIM LLM API version 1.0.0 2024-07-31 06:02:56,089 INFO worker.py:1749 -- Started a local Ray instance. INFO 07-31 06:02:57.440 llm_engine.py:164] Initializing an LLM engine (v0.5.0.post1) with config: model='/tmp/meta--llama-3_1-8b-instruct-ue01g3m7', speculative_config=None, tokenizer='/tmp/meta--llama-3_1-8b-instruct-ue01g3m7', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='lm-format-enforcer'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/tmp/meta--llama-3_1-8b-instruct-ue01g3m7) WARNING 07-31 06:02:57.842 logging.py:313] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. ERROR 07-31 06:03:02.136 worker_base.py:325] Error executing method load_model. This might cause deadlock in distributed execution. Traceback (most recent call last): File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 317, in execute_method return executor(*args, kwargs) File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 131, in load_model self.model_runner.load_model() File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 208, in load_model self.model = get_model( File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model return loader.load_model(model_config=model_config, File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 278, in load_model model = _initialize_model(model_config, self.load_config, File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 113, in _initialize_model return model_class(config=model_config.hf_config, File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 345, in init self.model = LlamaModel(config, File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 264, in init self.layers = nn.ModuleList(rank0"/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 265, in LlamaDecoderLayer(config=config, File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 190, in init self.self_attn = LlamaAttention( File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 139, in init self.rotary_emb = get_rope( File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding.py", line 860, in get_rope raise ValueError(f"Unknown RoPE scaling type {scaling_type}") ValueError: Unknown RoPE scaling type extended rank0: Traceback (most recent call last): rank0: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main rank0: return _run_code(code, main_globals, None, rank0: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code rank0: exec(code, run_globals) rank0: File "/opt/nim/llm/vllm_nvext/entrypoints/openai/api_server.py", line 702, in rank0: engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER) rank0: File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine_factory.py", line 33, in from_engine_args rank0: engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context) rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 429, in from_engine_args rank0: engine = cls( rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 363, in init rank0: self.engine = self._init_engine(*args, *kwargs) rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 505, in _init_engine rank0: return engine_class(args, kwargs) rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 233, in init rank0: self.model_executor = executor_class( rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 315, in init rank0: super().init(*args, *kwargs) rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init rank0: super().init(args, **kwargs) rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init

rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor

rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 188, in _init_workers_ray

rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 261, in _run_workers rank0: driver_worker_output = self.driver_worker.execute_method( rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 326, in execute_method rank0: raise e rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 317, in execute_method rank0: return executor(*args, **kwargs) rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 131, in load_model

rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 208, in load_model rank0: self.model = get_model( rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model rank0: return loader.load_model(model_config=model_config, rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 278, in load_model rank0: model = _initialize_model(model_config, self.load_config, rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 113, in _initialize_model rank0: return model_class(config=model_config.hf_config, rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 345, in init rank0: self.model = LlamaModel(config, rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 264, in init rank0: self.layers = nn.ModuleList(rank00]: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 265, in

rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 190, in init rank0: self.self_attn = LlamaAttention( rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 139, in init rank0: self.rotary_emb = get_rope( rank0: File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding.py", line 860, in get_rope rank0: raise ValueError(f"Unknown RoPE scaling type {scaling_type}") rank0: ValueError: Unknown RoPE scaling type extended

`

supertetelman commented 1 month ago

Hello, thanks for opening this issue and bringing it to our attention. This is an issue with the vLLM profile of the Llama 3.1 8b LLM NIM.

We've noted that this is not working on the official release notes page here: https://docs.nvidia.com/nim/large-language-models/latest/release-notes.html#known-issues "vLLM is not currently supported on Llama 3.1 models.", please stay tuned for updates.