RunnerService: MAX_MODEL_LEN is not reflected to the llm._max_model_len

hahmad2008 commented 9 months ago

Describe the bug

I can specify openLLM configuration by using start command:
- CUDA_VISIBLE_DEVICES=0,1 TRANSFORMERS_OFFLINE=1 openllm start mistral --model-id mymodel --dtype float16 --gpu-memory-utilization 0.95 --workers-per-resource 0.5
However, I can't change the values when I tried to use the service later on. I mean after openllm start then build
- openllm build mymodel --backend vllm --serialization safetensors
- bentoml containerize mymodel-service:12345 --opt progress=plain
I have following bento service:
- bentoml list Tag Size Model Size Creation Time mymodel-service:12345 56.29 KiB XX GiB 2024-02-12 13:11:19

Problem: I can't pass these values into service! even the envoiment vaiable MAX_MODEL_LEN is not reflected to the llm._max_model_len. Also I tried to change the bento.yaml file and then bentoml serve the service however still the problem there can't reflect this value to llm._max_model_len.

I have the following the same config on the bentoml path:
- bentoml/bentos/mymodel-service/12345/bento.yaml
- $ bentoml get mymodel-service:12345

service: generated_mistral_service:svc
name: mymodel-service
version: 12345
bentoml_version: 1.1.11
creation_time: '2024-02-12T13:11:19.273169+00:00'
labels:
  configuration: '{"generation_config":{"max_new_tokens":256,"min_length":0,"early_stopping":false,"num_beams":1,"num_beam_groups":1,"use_cache":true,"temperature":0.7,"top_k":40,"top_p":0.95,"typical_p":1.0,"epsilon_cutoff":0.0,"eta_cutoff"
  model_ids: '["HuggingFaceH4/zephyr-7b-alpha","HuggingFaceH4/zephyr-7b-beta","mistralai/Mistral-7B-Instruct-v0.2","mistralai/Mistral-7B-Instruct-v0.1","mistralai/Mistral-7B-v0.1"]'
  model_id: /root/OpenLLM/mymodel
  _type: mymodel
  _framework: vllm
  start_name: mistral
  base_name_or_path: /root/OpenLLM/mymodel
  bundler: openllm.bundle
  openllm_client_version: 0.4.45.dev2
  openllm_core_version: 0.4.45.dev2
  openllm_version: 0.4.45.dev2
models:
- tag: vllm-mymodel:12345
  module: openllm.serialisation.transformers
  creation_time: '2024-02-12T13:01:50.059463+00:00'
  alias: vllm-mymodel
runners:
- name: llm-mistral-runner
  runnable_type: vLLMRunnable
  embedded: false
  models:
  - vllm-mymodel:12345
  resource_config: null
apis:
- name: generate_v1
  input_type: JSON
  output_type: JSON
- name: generate_stream_v1
  input_type: JSON
  output_type: Text
- name: metadata_v1
  input_type: Text
  output_type: JSON
- name: helpers_messages_v1
  input_type: JSON
  output_type: Text
docker:
  distro: debian
  python_version: '3.11'
  cuda_version: null
  env:
    BENTOML_CONFIG_OPTIONS: tracing.sample_rate=1.0 api_server.max_runner_connections=25
      runners."llm-mistral-runner".batching.max_batch_size=128 api_server.traffic.timeout=36000000
      runners."llm-mistral-runner".traffic.timeout=36000000 runners."llm-mistral-runner".workers_per_resource=0.5
      api_server.http.cors.enabled=true api_server.http.cors.access_control_allow_origins="*"
      api_server.http.cors.access_control_allow_methods[0]="GET" api_server.http.cors.access_control_allow_methods[1]="OPTIONS"
      api_server.http.cors.access_control_allow_methods[2]="POST" api_server.http.cors.access_control_allow_methods[3]="HEAD"
      api_server.http.cors.access_control_allow_methods[4]="PUT"
    OPENLLM_MODEL_ID: /root/OpenLLM/mymodel
    BENTOML_DEBUG: 'False'
    OPENLLM_ADAPTER_MAP: 'null'
    OPENLLM_SERIALIZATION: safetensors
    OPENLLM_CONFIG: '''{"max_new_tokens":256,"min_length":0,"early_stopping":false,"num_beams":1,"num_beam_groups":1,"use_cache":true,"temperature":0.7,"top_k":40,"top_p":0.95,"typical_p":1.0,"epsilon_cutoff":0.0,"eta_cutoff":0.0,"diversity_
    BACKEND: vllm
    DTYPE: float16
    TRUST_REMOTE_CODE: 'False'
    MAX_MODEL_LEN: '1024'
    GPU_MEMORY_UTILIZATION: '0.95'
    NVIDIA_DRIVER_CAPABILITIES: compute,utility
  system_packages: null
  setup_script: null
  base_image: null
  dockerfile_template: null
python:
  requirements_txt: null
  packages:
  - scipy
  - bentoml[tracing]>=1.1.11,<1.2
  - openllm[vllm]>=0.4.44
  lock_packages: false
  index_url: null
  no_index: null
  trusted_host: null
  find_links: null
  extra_index_url: null
  pip_args: null
  wheels: null
conda:
  environment_yml: null
  channels: null
  dependencies: null
  pip: null

To reproduce

No response

Logs

No response

Environment

$ bentoml -v bentoml, version 1.1.11

$openllm -v openllm, 0.4.45.dev2 (compiled: False) Python (CPython) 3.11.7

System information (Optional)

No response

hahmad2008 commented 9 months ago

@aarnphm could you please check?

hahmad2008 commented 9 months ago

should be set in the openllm build

bentoml / OpenLLM