bentoml / OpenLLM

Run any open-source LLMs, such as Llama, Gemma, as OpenAI compatible API endpoint in the cloud.
https://bentoml.com
Apache License 2.0
10.05k stars 636 forks source link

RunnerService: MAX_MODEL_LEN is not reflected to the llm._max_model_len #902

Closed hahmad2008 closed 9 months ago

hahmad2008 commented 9 months ago

Describe the bug

Problem: I can't pass these values into service! even the envoiment vaiable MAX_MODEL_LEN is not reflected to the llm._max_model_len. Also I tried to change the bento.yaml file and then bentoml serve the service however still the problem there can't reflect this value to llm._max_model_len.

service: generated_mistral_service:svc
name: mymodel-service
version: 12345
bentoml_version: 1.1.11
creation_time: '2024-02-12T13:11:19.273169+00:00'
labels:
  configuration: '{"generation_config":{"max_new_tokens":256,"min_length":0,"early_stopping":false,"num_beams":1,"num_beam_groups":1,"use_cache":true,"temperature":0.7,"top_k":40,"top_p":0.95,"typical_p":1.0,"epsilon_cutoff":0.0,"eta_cutoff"
  model_ids: '["HuggingFaceH4/zephyr-7b-alpha","HuggingFaceH4/zephyr-7b-beta","mistralai/Mistral-7B-Instruct-v0.2","mistralai/Mistral-7B-Instruct-v0.1","mistralai/Mistral-7B-v0.1"]'
  model_id: /root/OpenLLM/mymodel
  _type: mymodel
  _framework: vllm
  start_name: mistral
  base_name_or_path: /root/OpenLLM/mymodel
  bundler: openllm.bundle
  openllm_client_version: 0.4.45.dev2
  openllm_core_version: 0.4.45.dev2
  openllm_version: 0.4.45.dev2
models:
- tag: vllm-mymodel:12345
  module: openllm.serialisation.transformers
  creation_time: '2024-02-12T13:01:50.059463+00:00'
  alias: vllm-mymodel
runners:
- name: llm-mistral-runner
  runnable_type: vLLMRunnable
  embedded: false
  models:
  - vllm-mymodel:12345
  resource_config: null
apis:
- name: generate_v1
  input_type: JSON
  output_type: JSON
- name: generate_stream_v1
  input_type: JSON
  output_type: Text
- name: metadata_v1
  input_type: Text
  output_type: JSON
- name: helpers_messages_v1
  input_type: JSON
  output_type: Text
docker:
  distro: debian
  python_version: '3.11'
  cuda_version: null
  env:
    BENTOML_CONFIG_OPTIONS: tracing.sample_rate=1.0 api_server.max_runner_connections=25
      runners."llm-mistral-runner".batching.max_batch_size=128 api_server.traffic.timeout=36000000
      runners."llm-mistral-runner".traffic.timeout=36000000 runners."llm-mistral-runner".workers_per_resource=0.5
      api_server.http.cors.enabled=true api_server.http.cors.access_control_allow_origins="*"
      api_server.http.cors.access_control_allow_methods[0]="GET" api_server.http.cors.access_control_allow_methods[1]="OPTIONS"
      api_server.http.cors.access_control_allow_methods[2]="POST" api_server.http.cors.access_control_allow_methods[3]="HEAD"
      api_server.http.cors.access_control_allow_methods[4]="PUT"
    OPENLLM_MODEL_ID: /root/OpenLLM/mymodel
    BENTOML_DEBUG: 'False'
    OPENLLM_ADAPTER_MAP: 'null'
    OPENLLM_SERIALIZATION: safetensors
    OPENLLM_CONFIG: '''{"max_new_tokens":256,"min_length":0,"early_stopping":false,"num_beams":1,"num_beam_groups":1,"use_cache":true,"temperature":0.7,"top_k":40,"top_p":0.95,"typical_p":1.0,"epsilon_cutoff":0.0,"eta_cutoff":0.0,"diversity_
    BACKEND: vllm
    DTYPE: float16
    TRUST_REMOTE_CODE: 'False'
    MAX_MODEL_LEN: '1024'
    GPU_MEMORY_UTILIZATION: '0.95'
    NVIDIA_DRIVER_CAPABILITIES: compute,utility
  system_packages: null
  setup_script: null
  base_image: null
  dockerfile_template: null
python:
  requirements_txt: null
  packages:
  - scipy
  - bentoml[tracing]>=1.1.11,<1.2
  - openllm[vllm]>=0.4.44
  lock_packages: false
  index_url: null
  no_index: null
  trusted_host: null
  find_links: null
  extra_index_url: null
  pip_args: null
  wheels: null
conda:
  environment_yml: null
  channels: null
  dependencies: null
  pip: null

To reproduce

No response

Logs

No response

Environment

$ bentoml -v bentoml, version 1.1.11

$openllm -v openllm, 0.4.45.dev2 (compiled: False) Python (CPython) 3.11.7

System information (Optional)

No response

hahmad2008 commented 9 months ago

@aarnphm could you please check?

hahmad2008 commented 9 months ago

should be set in the openllm build