defenseunicorns / leapfrogai

Production-ready Generative AI for local, cloud native, airgap, and edge deployments.
https://leapfrog.ai
Apache License 2.0
255 stars 28 forks source link

Changing models is a huge pain #1082

Open vanakema opened 3 weeks ago

vanakema commented 3 weeks ago

Describe what should be investigated or refactored

Currently the data rate for injecting a model into the model PVC is very slow (in relative terms). I observed a data rate of about 66mb/s despite all this network traffic occurring within the same machine.

I know that the gzip thread is completely pinned when the data injection step is happening which indicates a bottleneck to me (offical gzip bin also has no multithreading). @Racer159 told me you can turn off the gzip compression for the data injection, so that may be a good first step in at least getting the data transfer speed increased.

Ideally in the future, there would be an opinionated, easy, reliable, fast workflow for uploading a new model to the system.

jxtngx commented 2 weeks ago

Is it possible to deploy LFAI with several models?

Say for instance I want to have Phi, Llama, and Minitron (NVIDIA) variants available as options and switch between any of the models at will.

How might I do this?

gphorvath commented 2 weeks ago

@jxtngx there is a not a great way today. You could configure and run multiple vllm backends with different models - as long as they have different names so the API config listener will register two different models. This would still require fairly intimate knowledge of the LeapfrogAI backend to accomplish though

justinthelaw commented 2 weeks ago

Currently, the "easiest way" to use different models exists in this issue's branch: #835. There is not yet a hot-swap capability.

jxtngx commented 2 weeks ago

And if I simply want to use a model other than the default – do I only need to update the zarf.yaml for either vLLM or llama-cpp-python? And then redeploy the cluster & bundle?

justinthelaw commented 2 weeks ago

And if I simply want to use a model other than the default – do I only need to update the zarf.yaml for either vLLM or llama-cpp-python? And then redeploy the cluster & bundle?

In #854 you only need to modify the zarf-config.yaml and then perform a package create and package install.

Example of things you might change, in comments. Other params you can do some research on using the vLLM docs or on HuggingFace! The params below are not comprehensive, 1:1 of what is fully available on vLLM or HuggingFace and we are working on building that out.

package:
  create:
    set:
      # x-release-please-start-version
      image_version: "0.13.0"
      # x-release-please-end

      model_repo_id: "TheBloke/Synthia-7B-v2.0-GPTQ" # change this
      model_revision: "gptq-4bit-32g-actorder_True" # change this
      model_path: "/data/.model/"
      name_override: "vllm"
  deploy:
    set:
      # vLLM runtime configuration (usually influenced by .env in local development)
      trust_remote_code: "True" # change this (based on whether there is custom layer, config or inferencing code)
      tensor_parallel_size: "1" # change this (based on # of GPU instances for sharding, attn_heads, etc.)
      enforce_eager: "False"
      gpu_memory_utilization: "0.90"
      worker_use_ray: "True"
      engine_use_ray: "True"
      quantization: "None"
      load_format: "auto"
      # LeapfrogAI SDK runtime configuration (usually influenced by config.yaml in development)
      max_context_length: "32768" # change this
      stop_tokens: "</s>, <|im_end|>, <|endoftext|>" # change this
      prompt_format_chat_system: "SYSTEM: {}\n" # change this
      prompt_format_chat_user: "USER: {}\n" # change this
      prompt_format_chat_assistant: "ASSISTANT: {}\n" # change this
      temperature: "0.1"
      top_p: "1.0"
      top_k: "0"
      repetition_penalty: "1.0"
      max_new_tokens: "8192"
      # Pod deployment configuration
      gpu_limit: "1" # change this (based on tensor parallel size, and # of GPU instances)
      gpu_runtime: "nvidia"
      pvc_size: "15Gi" # change this (if model is larger or smaller)
      pvc_access_mode: "ReadWriteOnce"
      pvc_storage_class: "local-path"
jxtngx commented 2 weeks ago

@justinthelaw I am on CPU and am using llama-cpp-python.

Should there also be a zarf-config.yaml added to llama-cpp-python in #854's dev branch so that the two inference packages mirror each other?

justinthelaw commented 2 weeks ago

Unfortunately, llama.cpp has not been refactored to reflect the changes made in the vLLM package yet. See issue #1098.

In llama.cpp's case, you will need to modify the config.yaml and zarf.yaml.